Data pre processing

Data Cleaning
Annotations:
- Real-world data tend to be incomplete, noisy, and inconsistent. Data cleaning (or data cleansing) routines attempt to fill in missing values, smooth out noise while identifying outliers, and correct inconsistencies in the data.
1. Missing Values
  Annotations:
  - Imagine that you need to analyze AllElectronics sales and customer data. You note that many tuples have no recorded value for several attributes, such as customer income.
  1. Ignore the tuple
    Annotations:
    - This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values.
  2. Fill in the missing value manually
    Annotations:
    - In general, this approach is time-consuming and may not be feasible given a large data set with many missing values
  3. Use a global constant to fill in the missing value
    Annotations:
    - Replace all missing attribute values by the same constant, such as a label like “Unknown” or
  4. Use the attribute mean to fill in the missing value
    Annotations:
    - suppose that the average income of AllElectronics customers is $56,000. Use this value to replace the missing value for income.
  5. Use the attribute mean for all samples belonging to the same class as the given tuple
    Annotations:
    - if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple
  6. Use the most probable value to fill in the missing value
    Annotations:
    - This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data
Noisy Data
Annotations:
- What is noise?” Noise is a random error or variance in a measured variable. Given a numerical attribute such as, say, price, how can we “smooth” out the data to remove the noise? Let’s look at the following data smoothing techniques
1. Binning
  Annotations:
  - Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the values around it. The sorted values are distributed into a number of “buckets,” or bins.
2. Regression
  Annotations:
  - Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the “best” line to fit two attributes (or variables), so that one attribute can be used to predict the other
3. Clustering
  Annotations:
  - Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers (Figure 2.12). Chapter 7 is dedicated to the topic of clustering and outlier analysis.
Data Cleaning as a Process
Annotations:
- Missing values, noise, and inconsistencies contribute to inaccurate data. So far, we have looked at techniques for handling missing data and for smoothing data. “But data cleaning is a big job. What about data cleaning as a process? How exactly does one proceed in tackling this task?
Data Integration and Transformation
Annotations:
- Data mining often requires data integration—the merging of data from multiple data stores. The data may also need to be transformed into forms appropriate for mining. This section describes both data integration and data transformation.
1. Data Integration
  Annotations:
  - It is likely that your data analysis task will involve data integration, which combines data from multiple sources into a coherent data store, as in data warehousing. These sources may include multiple databases, data cubes, or flat files.
2. Data Transformation
  Annotations:
  - Smoothing, which works to remove noise from the data. Such techniques include binning, regression, and clustering. Aggregation, where summary or aggregation operations are applied to the data. For example, the daily sales data may be aggregated so as to compute monthly and annual total amounts. This step is typically used in constructing a data cube for analysis of the data at multiple granularities. Generalization of the data, where low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies. For example, categorical 2.4 Data Integration and Transformation 71 attributes, like street, can be generalized to higher-level concepts, like city or country. Similarly, values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-aged, and senior. Normalization, where the attribute data are scaled so as to fall within a small specified range, such as
Data Reduction
Annotations:
- Data reduction techniques can be applied to obtain a reduced representation of the data set that ismuch smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results
1. Data Cube Aggregation
  Annotations:
  - Imagine that you have collected the data for your analysis. These data consist of the AllElectronics sales per quarter, for the years 2002 to 2004. You are, however, interested in the annual sales (total per year), rather than the total per quarter
2. Attribute Subset Selection
  Annotations:
  - Data sets for analysis may contain hundreds of attributes, many of which may be irrelevant to the mining task or redundant. For example, if the task is to classify customers as to whether or not they are likely to purchase a popular new CD at AllElectronics when notified of a sale, attributes such as the customer’s telephone number are likely to be irrelevant, unlike attributes such as age or music taste
3. Dimensionality Reduction
  Annotations:
  - In dimensionality reduction, data encoding or transformations are applied so as to obtain a reduced or “compressed” representation of the original data. If the original data can be reconstructed from the compressed data without any loss of information, the data reduction is called lossless. If, instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. There are several well-tuned algorithms for string compression.
4. Principal Components Analysis
  Annotations:
  - In this subsection we provide an intuitive introduction to principal components analysis as a method of dimesionality reduction. A detailed theoretical explanation is beyond the scope of this book. Suppose that the data to be reduced consist of tuples or data vectors described by n attributes or dimensions. Principal components analysis, or PCA (also called the Karhunen-Loeve, or K-L, method), searches for k n-dimensional orthogonal vectors that can best be used to represent the data, where k n. The original data are thus projected onto a much smaller space, resulting in dimensionality reduction. Unlike attribute subset selection, which reduces the attribute set size by retaining a subset of the initial set of attributes, PCA “combines” the essence of attributes by creating an alternative, smaller set of variables. The initial data can then be projected onto this smaller set. PCA often reveals relationships that were not previously suspected and thereby allows interpretations that would not ordinarily result.
5. Numerosity Reduction
  Annotations:
  - Can we reduce the data volume by choosing alternative, ‘smaller’ forms of data representation?” Techniques of numerosity reduction can indeed be applied for this purpose. These techniques may be parametric or nonparametric. For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data.
  1. Regression and Log-Linear Models
    Annotations:
    - Regression and log-linear models can be used to approximate the given data. In (simple) linear regression, the data are modeled to fit a straight line. For example, a randomvariable, y (called a response variable), can bemodeled as a linear function of another random variable, x (called a predictor variable), with the equation y = wx+b,
  2. Histograms
    Annotations:
    - Histograms use binning to approximate data distributions and are a popular form of data reduction. Histograms were introduced in Section 2.2.3. A histogram for an attribute, A, partitions the data distribution of A into disjoint subsets, or buckets. If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets. Often, buckets instead represent continuous ranges for the given attribute.
  3. Clustering
    Annotations:
    - Clustering techniques consider data tuples as objects. They partition the objects into groups or clusters, so that objects within a cluster are “similar” to one another and “dissimilar” to objects in other clusters. Similarity is commonly defined in terms of how “close” the objects are in space, based on a distance function. The “quality” of a cluster may be represented by its diameter, the maximum distance between any two objects in the cluster. Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid (denoting the “average object,
  4. Sampling
    Annotations:
    - Sampling can be used as a data reduction technique because it allows a large data set to be represented by a much smaller random sample (or subset) of the data. Suppose that a large data set, D, contains N tuples. Let’s look at the most common ways that we could sample D for data reduction
Data Discretization and Concept Hierarchy Generation
Annotations:
- Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data
1. Discretization and Concept Hierarchy Generation for Numerical Data
  Annotations:
  - It is difficult and laborious to specify concept hierarchies for numerical attributes because of the wide diversity of possible data ranges and the frequent updates of data values. Such manual specification can also be quite arbitrary. Concept hierarchies for numerical attributes can be constructed automatically based on data discretization. We examine the following methods: binning, histogram analysis, entropy-based discretization, c2-merging, cluster analysis, and discretization by intuitive partitioning. In general, each method assumes that the values to be discretized are sorted in ascending order.
  1. Binning
    Annotations:
    - Binning is a top-down splitting technique based on a specified number of bins. Section 2.3.2 discussed binning methods for data smoothing. These methods are also used as discretization methods for numerosity reduction and concept hierarchy generation
  2. Histogram Analysis
    Annotations:
    - Like binning, histogram analysis is an unsupervised discretization technique because it does not use class information. Histograms partition the values for an attribute, A, into disjoint ranges called buckets. Histograms were introduced in Section 2.2.3. Partitioning rules for defining histograms were described in Section 2.5.4. In an equal-width histogram,
  3. Entropy-Based Discretization
    Annotations:
    - Entropy is one of the most commonly used discretization measures. It was first introduced by Claude Shannon in pioneering work on information theory and the concept of information gain. Entropy-based discretization is a supervised, top-down splitting technique. It explores class distribution information in its calculation and determination of split-points (data values for partitioning an attribute range). To discretize a numerical attribute, A, the method selects the value of A that has the minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization. Such discretization forms a concept hierarchy for A.
  4. Interval Merging by 2 Analysis
    Annotations:
    - ChiMerge is a c2-based discretization method. The discretization methods that we have studied up to this point have all employed a top-down, splitting strategy. This contrasts with ChiMerge, which employs a bottom-up approach by finding the best neighboring intervals and then merging these to form larger intervals, recursively. The method is supervised in that it uses class information.
  5. Cluster Analysis
    Annotations:
    - Cluster analysis is a popular data discretization method. A clustering algorithm can be applied to discretize a numerical attribute, A, by partitioning the values of A into clusters or groups.
  6. Discretization by Intuitive Partitioning
    Annotations:
    - hierarchies, many users would like to see numerical ranges partitioned into relatively uniform, easy-to-read intervals that appear intuitive or “natural.” For example, annual salaries broken into ranges like ($50,000, $60,000] are often more desirable than ranges like ($51,263.98, $60,872.34], obtained by, say, some sophisticated clustering analysis. The 3-4-5 rule can be used to segment numerical data into relatively uniform, naturalseeming intervals. In general, the rule partitions a given range of data into 3, 4, or 5 relatively equal-width intervals, recursively and level by level, based on the value range at the most significant digit.We will illustrate the use of the rule with an example further below. The rule is as follows:
2. Concept Hierarchy Generation for Categorical Data
  Annotations:
  - Categorical data are discrete data. Categorical attributes have a finite (but possibly large) number of distinct values, with no ordering among the values. Examples include geographic location, job category, and itemtype. There are several methods for the generation of concept hierarchies for categorical data.
  1. Specification of a partial ordering of attributes explicitly at the schema level by users or experts:
    Annotations:
    - Concept hierarchies for categorical attributes or dimensions typically involve a group of attributes. A user or expert can easily define a concept hierarchy by specifying a partial or total ordering of the attributes at the schema level. For example, a relational database or a dimension location of a data warehouse may contain the following group of attributes: street, city, province or state, and country. A hierarchy can be defined by specifying the total ordering among these attributes at the schema level, such as street < city < province or state < country
  2. Specification of a portion of a hierarchy by explicit data grouping
    Annotations:
    - A user may specify a set of attributes forming a concept hierarchy, but omit to explicitly state their partial ordering. The system can then try to automatically generate the attribute ordering so as to construct a meaningful concept hierarchy. “Without knowledge of data semantics, how can a hierarchical ordering for an arbitrary set of categorical attributes be found?” Consider the following observation that since higher-level concepts generally cover several subordinate lower-level concepts, an attribute defining a high concept level (e.g., country) will usually contain a smaller number of distinct values than an attribute defining a lower concept level (e.g., street). Based on this observation, a concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set.

Next up

Data pre processing

Description

Resource summary

0 comments

Similar

	Created by gopalsamybca about 10 years ago