We have detected that Javascript is not enabled in your browser. The dynamic nature of our site means that Javascript must be enabled to function properly. Please read our terms and conditions for more information.

Next up

Data Mining part 2

Description

Data Mining final review part 2

Flashcards by Kim Graff, updated more than 1 year ago

	Created by Kim Graff over 9 years ago

10

0

Resource summary

Question	Answer
Overfitting and Tree Pruning	Overfitting an induced tree may overfit the training data. Too many branches, some may reflect anomalies due to noise or outliers. Poor accuracy for unseen samples.
Two approaches to avoid overfitting	PREPRUNING: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold. Difficult to choose an appropriate threshold. PPOSTPRUNING: Remove branches from "fully grown" tree—get a sequence of progressively pruned trees. Use a set of data different from the training data to decide which is the "best pruned tree"
Naive Bayesian Classifer: Comments	Image: dc3e853d-dce4-4808-949e-187c214997ea (image/png)
Associative Classification	Association rules are generated and analyzed for use in classification Search for strong associations between frequent patterns (conjuring of attribute-value pairs) and class labels
Associative Classification: Why effective?	It explores highly confident associations among multiple attributes and may overcome some constraints introduced by decision-tree induction, which considers only one attribute at a time In many studies, associative classification has been found to be more accurate than some traditional classification methods, such as C4.5
Classifier Accuracy Measures	Image: cfedb027-0546-4473-9078-90bfb489c745 (image/png)
Evaluating the Accuracy of a Classifier or Predictor: HOLDOUT METHOD	Given data is randomly partitioned into two independent sets: TRAINING SET (e.g., 2/3) for model construction TEST SET (e.g., 1/3) for accuracy estimation RANDOM SAMPLING: a variation of holdout — repeat holdout k times, accuracy = avg. of the accuracies obtained
Evaluating the Accuracy of a Classifier or Predictor	CROSS-VALIDATION (k-fold, where k = 10 is most popular) Randomly partition the data in k mutually exclusive subsets, each approximately equal size. At i-th iteration, use Di as test set and others as training set. LEAVE-ONE-OUT: k folds where k = # of tuples, for small size data STRATIFIED CROSS-VALIDATION: folds are stratified so that class list. in each fold is approx. the same as that in the initial data
Ensemble Methods: Increasing the Accuracy	Ensemble Methods: use a combination of models to increase accuracy, combine series of k learned models, M1, M2, ..., Mk, with the aim of creating improved model M* POPULAR ENSEMBLE METHODS: Bagging: average the prediction over a collection of classifiers Boosting: weighted vote with collection of classifiers
An example of K-Means Clustering	Image: 8d13e62a-e9a5-4923-9da2-1eff8a51d24d (image/png)
Hierarchical Clustering	Image: 88ca6fd7-c9c1-484f-8bed-679d428299cf (image/png)
Density-Based Clustering: Basic Concepts	Image: 224bb7b9-336e-4383-aef2-f44e581576a0 (image/png)
Density reachable	Image: 1bcb8750-01bb-4694-9837-181c0370b1b2 (image/png)
Density-connected	Image: 639572d0-a53a-41a9-96bc-9584a634e6df (image/png)
Density Based Spatial Clustering of Applications with noise	Image: d90533e3-8e85-4c0e-8573-64e3ca0cd952 (image/png)
The Curse of Dimensionality	Data in only one dimension is relatively packed adding a dimension stretch the points across that dimension, making them further apart adding more dimensions will make the points further apart-high dimensional data is extremely spares distance measure becomes meaningless due to equi-distance

Show full summary Hide full summary

Want to create your own Flashcards for free with GoConqr? Learn more.

Similar

Chapter 19 Key Terms

Monica Holloway

Data Warehousing and Mining

i7752068

Insurance Policy Advisor

Sufiah Takeisu

Data Mining Part 1

Kim Graff

Minería de Datos.

Marcos Soledispa

Machine Learning

Alberto Ochoa

Data Mining from Big Data 4V-s

Prohor Leykin

Model Roles

Steve Hiscock

Data Mining Process

Steve Hiscock

Data Mining Tasks

Steve Hiscock

Distribution Types

Steve Hiscock