Data Preprocessing

Descripción

Data Preprocessing Technique in Data Mining
nithi131993
Mapa Mental por nithi131993, actualizado hace más de 1 año
nithi131993
Creado por nithi131993 hace casi 10 años
43
0

Resumen del Recurso

Data Preprocessing
  1. Why Preprocess the data?
    1. Consistency

      Nota:

      • Reduce representation of data set.
      • Dimension Reduction Numerosity Reduction
      1. Data Transformation

        Nota:

        • Normalization,data discretization and concept hierarchy generation are forms of Data Transformation.
        1. Completeness

          Nota:

          • Integrating multiple databases,data cubes or files.
          1. Accuracy

            Nota:

            • There are many possible reasons for inaccurate data. For example,the age of a person must be enter within 100.
            1. Believability

              Nota:

              • Believability reflects how much the data are trusted by users.
              1. Interpretability

                Nota:

                • Interpretability reflects how easy the data are understood to users.
              2. Major Tasks in Data Preprocessing
                1. Data Cleaning

                  Nota:

                  • To Clean the data by filling in missing values,smoothing noisy data,identifying or removing outliers, and resolving inconsistencies.
                  • 1. Missing Values. 2. Noisy Data. 3. Data Cleaning as a Process.
                  1. Missing Values

                    Nota:

                    • 1. Ignore the tuple.
                    • 2. Fill in the missing value manually.
                    • 3. Use a global constant to fill in the missing value.
                    • 5. Use the attribute mean or median for all samples belonging to the same class as the given tuple.
                    • 6. Use the most probable value to fill in the missing value.
                    1. Noisy Data

                      Nota:

                      • 1. Binning  -> Smoothing by bin means.-> Smoothing by bin medians.-> Smoothing by bin boundaries.
                      • 2. Regression -> Linear Regression
                      1. Data Cleaning as a Process

                        Nota:

                        • Rules Used: -> Unique Rule. -> Consecutive Rule. -> Null Rule.
                        • Tools Used: -> Data Subscribing Tools. -> Data Auditing Tools. -> Data Migration Tools.
                      2. Data Integration

                        Nota:

                        • To integrate multiple databases,data cubes, or files.
                        1. Redundancy and Correlation Analysis

                          Nota:

                          • Redundancy can be detected by Correlation Analysis.
                          • 1. Correlation Test for Nominal Data. 2. Correlation Coefficient for Numeric Data. 3. Covariance of Numeric Data.
                          1. Entity Identification Problem
                            1. Tuple Duplication

                              Nota:

                              • Duplication should also be detected at tuple level.
                              1. Data Value Conflict Detection and Resolution

                                Nota:

                                • Data Integration also involves the detection and resolution of  Data Value Conflicts.
                              2. Data Transformation and Data Discretization

                                Nota:

                                • Normalization, data discretization, and concept hierarchy generation are forms of Data Transformation.
                                1. Data Transformation Strategies

                                  Nota:

                                  • 1. Smoothing. 2. Attribute Construction. 3. Aggregation. 4. Normalization. 5. Discretization. 6. Concept Hierarchy Generation for nominal data.
                                  1. Data Transformation by Normalization

                                    Nota:

                                    • -> Min-Max Normalization. -> Z- Score Normalization. -> Decimal Scaling.
                                    1. Discretization by Binning

                                      Nota:

                                      • Binning is a top-down splitting technique based on a specified number of bins.
                                      1. Discretization by Histograms

                                        Nota:

                                        • Histogram analysis is an unsupervised discretization technique because it does not use class information.
                                        1. Discretization by Cluster,Decision Tree and Correlation Analyses
                                          1. Concept Hierarchy Generation for Nominal Data

                                            Nota:

                                            • 1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts.
                                            • 2. Specification of a portion of a hierarchy by explicit data grouping.
                                            • 3. Specification of set of attributes,but not of their partial ordering.
                                            • 4. Specification of only a partial set of attributes.
                                          2. Data Reduction

                                            Nota:

                                            • To obtain the reduced representation of the data set.
                                            1. Data Reduction Strategies

                                              Nota:

                                              • Dimensionality Reduction. Numerosity Reduction.
                                              • Data Compression. -> Lossless. -> Lossy.
                                              1. Wavelet Transforms

                                                Nota:

                                                • Discrete Wavelet Transform.
                                                1. Principal Components Analysis

                                                  Nota:

                                                  • Also called Karhunen - Loeve method.
                                                  1. Attribute Subset Selection

                                                    Nota:

                                                    • Reduces the data set size by removing irrelevant or redundant attributes.
                                                    • -> Step Forward Selection. -> Step Backward Elimination. -> Combination of both. -> Decision Tree Induction.
                                                    1. Regression and Log - Linear Models

                                                      Nota:

                                                      • Linear Regression. Multiple Linear Regression. Log-Linear models.
                                                      1. Histograms

                                                        Nota:

                                                        • Use binning to approximate data distributions and are a popular form of data reduction.
                                                        • -> Equal Width. -> Equal Frequency.
                                                        1. Clustering

                                                          Nota:

                                                          • Clustering consider data tuples as objects. Centroid distance is an alternative measure of cluster quality.
                                                          1. Sampling

                                                            Nota:

                                                            • Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random data sample.
                                                        Mostrar resumen completo Ocultar resumen completo

                                                        Similar

                                                        Chapter 19 Key Terms
                                                        Monica Holloway
                                                        Data Warehousing and Mining
                                                        i7752068
                                                        Insurance Policy Advisor
                                                        Sufiah Takeisu
                                                        Data Mining Part 1
                                                        Kim Graff
                                                        Minería de Datos.
                                                        Marcos Soledispa
                                                        Machine Learning
                                                        Alberto Ochoa
                                                        Data Mining from Big Data 4V-s
                                                        Prohor Leykin
                                                        Model Roles
                                                        Steve Hiscock
                                                        Data Mining Process
                                                        Steve Hiscock
                                                        Data Mining Tasks
                                                        Steve Hiscock
                                                        Distribution Types
                                                        Steve Hiscock