Created by Tyson Mcleod
about 6 years ago
|
||
Question | Answer |
What is clustering? | Finding groups of objects such that objects in one group will be similar to one another and different from objects in other groups |
Partitional clustering | Data objects in non-overlapping subsets (clusters) such that each object is in exactly one subset. |
Hierarchical clustering | A set of nested clusters organized as a hierarchical tree |
Clustering algorithms (3 used in this course) | K - means (partitional) Density - based clustering Hierarchical clustering |
Clustering distinctions 1. Exclusive versus non-exclusive 2. Fuzzy versus non-fuzzy 3. Partial versus complete 4. Heterogeneous versus homogeneous | 1. non-exclusive: points may belong to multiple clusters 2. a point belongs to every cluster with weight between 0 and 1. weights must sum to 1 3. partial cluster: only want to cluster some of the data 4. clusters of widely different sizes, shapes and densities |
Centroid | (typically) The mean of the points in the cluster |
K - means complexity | O(n*K*I*d) n = number of points K = number of clusters I = number of iterations d = number of attributes |
Well-separated cluster | Every point in a cluster is closer to every other point in the cluster, than to any point not in the cluster |
Center-based cluster | Every point in the cluster is close to the center of the cluster (often the centroid) |
Contiguity- based cluster | A point in a a cluster is closer to one or more other points in the cluster than to any point not in the cluster |
Density-based cluster | A cluster is a dense region of points separated by low-density regions from other regions of high density |
Shared property clusters (conceptual clusters) | Clusters that share some common property or represent a particular concept |
Pre-processing and post-processing (clustering) | Pre-processing: normalize the data and eliminate outliers Post-processing: 1. eliminate small clusters that may represent outliers. 2.Split loose clusters with high SSE. 3.Merge clusters with low SSE. |
Strengths of hierarchical clustering | * Desired number of clusters can be obtained by cutting the dendogram. * Meaningful taxonomies e.g. (animal kingdom) |
Two main types of hierarchical clustering | Agglomerative Divisive |
Cluster similarity - MIN or Single Link | Similarity of two clusters is based on the two most similar (closest) points in the different clusters |
Cluster similarity - MAX or Complete Link | Similarity of two clusters is based on the two most different (distant) points in the different clusters |
Cluster similarity - Group average | Proximity of two clusters is the average of pairwise proximity between points in two clusters |
Cluster similarity - Ward's method | Similarity of two clusters is based on the increase in squared error when two clusters are merged. Less susceptible to noise. |
DBSCAN - density | Density = the number of points within a specified radius Eps |
DBSCAN - MinPts | A point is a core point if it has more than a specified number of points (MinPts) within Eps |
DBSCAN - Border | A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point. |
DBSCAN - Noise | A noise point is any point that is not a core point or a border point |
DBSCAN - pros and cons | Pros: cluster of arbitrary shape, robust to noise, does not need a priori k Cons: required connected regions of sufficiently high density, data sets with varying densities are problematic |
Want to create your own Flashcards for free with GoConqr? Learn more.