pattern discovery

Page 1

Over view

Q.What are patterns?A. Patterns are a set of items , subsequences or sub structures that frequently occur together in a data set.Patterns represents important and intrinsic properties of data set.Pattern Discovery :It is uncovering patterns from massive data sets.e.g - what products often bought together?or targeted marketing or checking what products are bought after an ipad.What code segment likely contains copy and paste bugs?What word sequences forms phrases?It is important to find inherent regularities in a data set.It serves as a foundation for many data mining tasks:- Association, causality and correlation analysis.- Mining sequential, structural patterns. - Pattern analysis in spatiotemporal(belonging to both space and time), multi-media, time series and stream data.- Classification - discriminative pattern-based analysis.- Cluster analysis - pattern bases subspace clustering.Item set - A set of one or more items.k - item set - A set of item containing 10 items.Support(count) - frequency of an item in a data set.It can be absolute or relative.Frequent patternsWe can say an item is frequent if it passes a certain threshold of support value.From frequent item sets we can derive association rules - Association rule - An association rule is gives as - X -> Y(s,c)s is the Support. It is the probability of X and Y occurring simultaneously.i.e. probability of X intersecting Y. c is the confidence. It is the conditional probability of occurrence of Y if X has already happened.Also c = sup(XuY)/sup(X)Support and confidence are the measure of rule interestingness they are used to evaluate the usefulness and the certainty of the rule discovered.Association rule mining - Find all the rules. X->Y, with minimum support and confidence threshold.There are a limited association rules that can be generated from a give transaction data.Challenge : We may generate too many frequent patterns.To handle this situation we have two solutions:Closed pattern - A pattern X is a closed pattern if it is frequent, and there is no super pattern XcY ( inverse subset) Y with same frequent pattern support as X.Closed pattern is the lossless compression of frequent patterns.another solution is:Max pattern - A pattern X is a max pattern if it is frequent and there exists no super pattern Y, such that XcY. The difference between this and closed pattern is that there is no mention of support here.It does not care about the real support of the sub patterns of a max pattern.Also max pattern is a lossy compression unlike closed pattern. Here the information about the sub patterns' support is lost in the max pattern.So, closed patterns is more desirable than max patterns.

Capture (image/jpeg)

Page 2

Efficient pattern mining methods

Downward closure property(Apriori) of frequent patternsApriori : Every subset of a frequent itemset must be frequent. Efficient mining methodology : If any of the subset of an itemset S is infrequent then there is no way the S is a frequent itemset.Aprioti Pruning principle: If an itemset is not frequent then it's superset should not even be generated. We have three mining methods : 1.Level wise, join based approach : Apriori 2.Vertical data format : ECLAT 3.Frequent pattern projection and growth : FPgrowthApriori Algorithm: The name is because the algorithm uses the knowledge of the prior frequent itemsets. It is a level wise,candidate generation and test approach. OUTLINE OF THE ALGORITHM: Initially generate all the frequent 1item frequent itemset. Then run a loop on the database for generation of k+1 length frequent itemset using the knowledge about the frequent k length item set. Then again increase the length by one and repeat the above procedure until no more frequent itemset can be generated. Then there was the pseudo code. The concrete implementation is given as : 1.self joining Fk(the frequent itemsets), joining 1 item which is part of the frequent k-1 length item set is joined to items of L(k-1) 2.pruning, it is used to lessen the candidate items by checking the frequency support with the help of hash table.Improvements in apriori: -Reduce number of passes through the database. 1. Partitioning method 2.Dynamic item set counting -Shrink the number of candidates 1.Hashing. Interesting for mining max patterns. 2.Pruning by supporting lower bounding. 3.Sampling -Explore special data structures 1.Tree projection 2.H-tree 3.Hypercube decomposition A new tree structure called fptree will be seen.Partitioning method: Scan database only twice. This method guarantees that you only need to scan data twice before all the case. Theorem:Any itemset that is frequent in the transactional database must be frequent in atleast one of the partitioned databases. 1.The first scan is used to partition the database into k databases. How to partition? Partition such that each part fits into the main memory. Because no matter how you scan them you will not involve any database i/o. And then you can use it to derive k frequent items.We can derive all the local frequent item sets. 2.Using the theorem above the second scan can be done such that any candidate set that is frequenting in any one of the partitioned database will also be a global candidate set for frequent item set. Then you add these and find the global frequent data sets.Direct hashing and pruning: DHP is a method used to reduce the number of candidates. The observation is: if the k- item set is frequent then in the hashing bucket,it must be containing several potential item sets that must be frequent. So if hash count is not frequent then, any one of them will not be frequent. So while we are counting the 1 item set things, the two item set hash table is set up, and we can also cut entries from the two item set hash table by looking at the frequency supports and removing entries hence making the hash table shorter.ECLAT: This method is called equivalence class transformation. a depth first search algo using set intersection. The original data format is horizontal. Then in here we transform this horizontal data format into vertical data format.item and tid getting interchanged. In the vertical list, to find the tid list of a k item list an intersection of their respective tid's is performed.and if the item contains sufficient numbers of transactions then they are frequent or else they are infrequent. Properties of tid lists - 1. If tid's of two item lists are equivalent then they are happening at the same time. 2. If transaction id of one item is the subset of the transaction id of another item then the bigger one always happens when the smaller set happens. 3. To derive frequent patterns based on vertical intersections it is used. 4. We use diffset to accelarate the mining Diffset - Each item will have a huge number of transaction id's associated. So diffset is a better method because the intersection can be large, so we write the difference as that will be smaller and eat up lesser storage. FPGrowth - It is mining pattern by pattern growth. Method - Find frequent single items and partition the database based on each such items. Then recursively grow frequent patterns by doing the above for each partitioned database(also called conditional database). To facilitate efficient processing we use the a new data structure called FP-tree. So mining becomes - Recursively construct and mine the conditional FP-trees, until the resulting FP-tree is empty or until it contains only one path- single path will generate all the combinations of its sub paths, each of which is a frequent pattern. How to construct FP-tree from a transactional database? 1.Scan DB to find the single item frequent pattern. 2.Sort frequent items in frequency descending list. 3.Scan DB again, create FP-tree. The tree's leaves or parts of leaves are also assigned the support. How to mine this tree? To mine this tree we can use divide and conquer and do it recursively. Then we can also find out conditional support of patterns.Like support of patterns containing p or support of pattern containing m but not p. Based on this tree we can find out transformed prefix paths and their respective supports too. This is the conditional pattern base that we are talking about. For each conditional pattern base, we, - mine single item patterns. - Construct its FP-tree and mine it. When there is a shared single prefix path P, two things can happen - Reduction of single prefix path into one node.Concatenation of mining results of the two parts.Scaling FP-growth by database projection. This is used when the database can not fit into the main memory. - Project the DB based on the patterns ,based on single itemset. - Then we can construct and mine FP-tree for each projected database.Parallel projection vs. Partition projection. Parallel Projection - Project the database on each frequent items. This means, take an item f4 and project the items which are same or more frequent than it. It is space costly but all partitions can be processed in parallel. Partition Projection - Partition the database in order.Here we pass the unprocessed parts to subsequent partitions so that the items related to a single transaction are not repeated and so it ain't space costly.Closet+ For mining closed patterns. ITEMSET MERGING - If Y appears in every occurrence of X, then Y is merged with X. There are many other tricks like - -Hybrid tree projection -Bottom up physical tree projection -Top Down pseudo tree projection -Sub-itemset pruning -item skipping -Efficient subset checking

Page 3

Pattern evaluation :

Pattern mining generates a large set of patterns or rules.Not all are interesting. Interestingness measurement : objective vs. subjective Objective interesting measures: support , confidence and correlation. Subjective measures : depends on user 1.query based:relevant to a particular user. 2.Against one's knowledge base : unexpected stuff, fresh stuff, timeliness 3.Visualization tools : Multidimensional , interactive stuff that user can select on their own But sometimes, the mathematical measures can be misleading. e.g. the basketball cereal thing. So support and confidence are not enough to describe a rule. So other interestingness measures - 1.Lift 2.chi^2 LIFT : Lift(B,C) may tell how the items are correlated. Lift(B,C) = 1 if it is independent It is >1 if they are positively correlated and it is <1 if they are negatively correlated. CHI^2: if it is zero then the elements are independent. If it is >0 then they are positively or negatively correlated. So it needs additional test to determine +/- correlation. We can determine it by checking observed(o) and expected(e) values if o if o>e , it's +ve. But these two measures might not work in every scenario. esp. when there are too many null transactions. And they clearly give wrong values when the values should indicate something else. Soooo.... we'll use the Null invariance measures. There are five of these measures. with different values when dealt with imbalanced conditions. But Kulcyzynski is kind of always in the middle. So a new concept: Imbalance ratio:

Opop (image/jpeg)

Kok (image/jpeg)

Page 4

Mining multi level frequent patterns:

Items form heirarchy. And also set minimum threshold across all the levels. Problem Redundancy - So what do we do ? We use efficient mining : by using shared multi level mining, here we use the lowest min - support to pass down the set of candidates. But sometimes we need to have a customized min support setting for different kinds of items One method is to used group based individualized min support. Mining multi dimensional association rules : multi-dimensional rules are not from the same dimension. They are different aspects of the data that we are looking at. To types of attributes - 1.Categorical So to mine these complex hybrid structures we use data cubes. 2.Quantitative attributes. To mine these we use discretization method,clustering method. Mining Quantitative associations - Numerical attributes like age and salary are covered here. Methods to mine them - 1.Static discretization based on predefined concept heirarchies. -Data cube based aggregation. 2.Dynamic descritization based on data distribution. 3.Clustering - distance based association. 4.Deviation analysis Mining extraordinary phenemenon : When there is any deviation from the normal we would like to know about it. So LHS is the subset of the population and the RHS is the extra ordinary behavior of the subset. The rule is accepted only after a statistical test confirms it with high confidence . Subrule:Highlights the extra ordinary behavior of the subset of the population of the super rule. Efficient methods have been developed to mine these rules : Aumann and Lindell KDD Ming negative correlations : Rare pattern vs. Negative pattern Rare pattern has a low support but it is very interesting. Negative patterns - that are negatively correlated and are unlikely to happen together. How to find negatively correlated patterns : sup(AuB) << sup(A)xsup(B) Good definition of negative correlation should also take care of the null invariance. Kulczynski measure based definition. if P(B|A)+P(A|B)/2 where e is the epsilon which represents the negative threshold then A and B are negatively correlated. Mining Compressed Patterns : When there are too many scattered patterns which are not very meaningful we use this method. We use the pattern distance formula to figure it out.Then we use delta clustering ,So it's like find all the patterns which can be represented with P and have a pattern distance smaller than delta.

Page 5

Sequential pattern and sequential pattern mining

sequence application : sequence of error occurence Calling pattern This can be done on three types of databases: Transaction DB, Sequence DB and time-series DB Again two kinds : Gapped and Non gapped sequential patterns : Gapped pattern - you do alow to have gaps within the patterns. Non gapped - not allowed, everything is important. eg - shopping , click streams and biological sequences. Sequential pattern mining - Given a set of sequences, we are trying to find a cpomplete set of frequent sub sequences that are satisfying a minimum threshold. Here the sequences consists of elements which contains the items.Items within the element are unordered and we list them alphabetically. So what is a subsequence? It is a part of a longer sequence. It may have gaps in between the items of the sequences, So here when we set a support threshold it means that at least that many number of sequences contains the subsequence. Algorithm requirements : It should be efficient , scalable , finding complete set and incorporates various kinds of user-specific constraints. For sequential pattern mining apriori property still holds : If a subsequence s1 is infrequent then none of s's supersequence can be frequent. The representative algos: 1.GSP(Generalised sequential pattern) - 2. Vertical format based mining - spade 3. pattern growth method : PrefixSpan 4.Mining sequential closed pattern : Clospan 5.constraint based sequential pattern mining. GSP: It is apriori based sequential pattern mining algorithm. Initially scan for all the singleton sequences. Then combine then and generate length two sequences. So the algorithm works like this : Repeat for each level: Scan DB to find length - k frequent sequence. Generate length - (k+1) candidate sequence from length k frequent sequences using apriori set k = k+1 Do this until no more sequences can be found. Spade : Convert the simple data base into vertical format then again use apriori to extract the k length frequent pattern. Pretty much the same as Gsp but it's only in the vertical format. PrefixSpan : Prefix span is a slow paced algorithm. Prefix - stuff in the front. Suffix - prefix based projection. Method: Find all the length one sequential pattern. Then divide the search space and mine each projected DB. Then we find a projected database, B projected database etc. In this method an item or an element in the sequence is taken and it's suffixes are registered. However this can get redundant. So we can use pseudo projection, by mentioning the sequence and add a pointer from which the projected sequence starts. But this will only work when DB can fit in the main memory or else we have to make a lot of physical calls to the DB. But we can also integrate the pseudo and physical projection when it doesnot fit. Clospan: Two ways to mine closed sequential patterns: 1. Mine all the patterns and then check which one is closed. 2. Directly mine closed patterns. Property : If s1 c S , then S is closed iff all the projected DB has the same support. So we can use two types of pruning here. Backward subpattern pruning and backward super pattern pruning.

Page 6

Spatial frequent patterns and association

Association rules are also in the form of A->B with suports and confidents A and B can be Topological relations. Spatial orientations and distance information Heirarchy of spatial relaionships: close_to is generation near_by , touch, intersection There is progressive refinement. i.e. we search for more general relationships then go for refined relationships. So we can do two step mining association. 1.Rough spatial computation with low cost algo using MBR - Minimum bounding rectangle or R-tree for rough estimate. 2. Detailed spatial computation : algo with less mining cost. Mining Spatial colocation patterns: a group of spatial features and events can be colocated in the same region hence they are called colocated. Two steps included 1. rough estimation with low cost algorithm 2.Detailed estimation with h cost algo We can find colocation patterns and relationships using 1.Participation ratio

Next up

pattern discovery

Description

Resource summary