Data Preprocessing

Descrição

Data Preprocessing Technique in Data Mining
nithi131993
Mapa Mental por nithi131993, atualizado more than 1 year ago
nithi131993
Criado por nithi131993 quase 10 anos atrás
43
0

Resumo de Recurso

Data Preprocessing
  1. Why Preprocess the data?
    1. Consistency

      Anotações:

      • Reduce representation of data set.
      • Dimension Reduction Numerosity Reduction
      1. Data Transformation

        Anotações:

        • Normalization,data discretization and concept hierarchy generation are forms of Data Transformation.
        1. Completeness

          Anotações:

          • Integrating multiple databases,data cubes or files.
          1. Accuracy

            Anotações:

            • There are many possible reasons for inaccurate data. For example,the age of a person must be enter within 100.
            1. Believability

              Anotações:

              • Believability reflects how much the data are trusted by users.
              1. Interpretability

                Anotações:

                • Interpretability reflects how easy the data are understood to users.
              2. Major Tasks in Data Preprocessing
                1. Data Cleaning

                  Anotações:

                  • To Clean the data by filling in missing values,smoothing noisy data,identifying or removing outliers, and resolving inconsistencies.
                  • 1. Missing Values. 2. Noisy Data. 3. Data Cleaning as a Process.
                  1. Missing Values

                    Anotações:

                    • 1. Ignore the tuple.
                    • 2. Fill in the missing value manually.
                    • 3. Use a global constant to fill in the missing value.
                    • 5. Use the attribute mean or median for all samples belonging to the same class as the given tuple.
                    • 6. Use the most probable value to fill in the missing value.
                    1. Noisy Data

                      Anotações:

                      • 1. Binning  -> Smoothing by bin means.-> Smoothing by bin medians.-> Smoothing by bin boundaries.
                      • 2. Regression -> Linear Regression
                      1. Data Cleaning as a Process

                        Anotações:

                        • Rules Used: -> Unique Rule. -> Consecutive Rule. -> Null Rule.
                        • Tools Used: -> Data Subscribing Tools. -> Data Auditing Tools. -> Data Migration Tools.
                      2. Data Integration

                        Anotações:

                        • To integrate multiple databases,data cubes, or files.
                        1. Redundancy and Correlation Analysis

                          Anotações:

                          • Redundancy can be detected by Correlation Analysis.
                          • 1. Correlation Test for Nominal Data. 2. Correlation Coefficient for Numeric Data. 3. Covariance of Numeric Data.
                          1. Entity Identification Problem
                            1. Tuple Duplication

                              Anotações:

                              • Duplication should also be detected at tuple level.
                              1. Data Value Conflict Detection and Resolution

                                Anotações:

                                • Data Integration also involves the detection and resolution of  Data Value Conflicts.
                              2. Data Transformation and Data Discretization

                                Anotações:

                                • Normalization, data discretization, and concept hierarchy generation are forms of Data Transformation.
                                1. Data Transformation Strategies

                                  Anotações:

                                  • 1. Smoothing. 2. Attribute Construction. 3. Aggregation. 4. Normalization. 5. Discretization. 6. Concept Hierarchy Generation for nominal data.
                                  1. Data Transformation by Normalization

                                    Anotações:

                                    • -> Min-Max Normalization. -> Z- Score Normalization. -> Decimal Scaling.
                                    1. Discretization by Binning

                                      Anotações:

                                      • Binning is a top-down splitting technique based on a specified number of bins.
                                      1. Discretization by Histograms

                                        Anotações:

                                        • Histogram analysis is an unsupervised discretization technique because it does not use class information.
                                        1. Discretization by Cluster,Decision Tree and Correlation Analyses
                                          1. Concept Hierarchy Generation for Nominal Data

                                            Anotações:

                                            • 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts.
                                            • 2. Specification of a portion of a hierarchy by explicit data grouping.
                                            • 3. Specification of set of attributes,but not of their partial ordering.
                                            • 4. Specification of only a partial set of attributes.
                                          2. Data Reduction

                                            Anotações:

                                            • To obtain the reduced representation of the data set.
                                            1. Data Reduction Strategies

                                              Anotações:

                                              • Dimensionality Reduction. Numerosity Reduction.
                                              • Data Compression. -> Lossless. -> Lossy.
                                              1. Wavelet Transforms

                                                Anotações:

                                                • Discrete Wavelet Transform.
                                                1. Principal Components Analysis

                                                  Anotações:

                                                  • Also called Karhunen - Loeve method.
                                                  1. Attribute Subset Selection

                                                    Anotações:

                                                    • Reduces the data set size by removing irrelevant or redundant attributes.
                                                    • -> Step Forward Selection. -> Step Backward Elimination. -> Combination of both. -> Decision Tree Induction.
                                                    1. Regression and Log - Linear Models

                                                      Anotações:

                                                      • Linear Regression. Multiple Linear Regression. Log-Linear models.
                                                      1. Histograms

                                                        Anotações:

                                                        • Use binning to approximate data distributions and are a popular form of data reduction.
                                                        • -> Equal Width. -> Equal Frequency.
                                                        1. Clustering

                                                          Anotações:

                                                          • Clustering consider data tuples as objects. Centroid distance is an alternative measure of cluster quality.
                                                          1. Sampling

                                                            Anotações:

                                                            • Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample.

                                                        Semelhante

                                                        Chapter 19 Key Terms
                                                        Monica Holloway
                                                        Data Warehousing and Mining
                                                        i7752068
                                                        Insurance Policy Advisor
                                                        Sufiah Takeisu
                                                        Data Mining Part 1
                                                        Kim Graff
                                                        Minería de Datos.
                                                        Marcos Soledispa
                                                        Machine Learning
                                                        Alberto Ochoa
                                                        Data Mining from Big Data 4V-s
                                                        Prohor Leykin
                                                        Model Roles
                                                        Steve Hiscock
                                                        Data Mining Process
                                                        Steve Hiscock
                                                        Data Mining Tasks
                                                        Steve Hiscock
                                                        Distribution Types
                                                        Steve Hiscock