Chapter 3 Key Terms

Question	Answer
Alternative Hypothesis	The opposite of the null hypothesis, or a potential result that the analyst may expect.
Benford’s law	The principle that in any large, randomly produced set of natural numbers, there is an expected distribution of the first, or leading, digit with 1 being the most common, 2 the next most, and down successively to the number 9.
causal modeling	A data approach similar to regression, but used to test for cause-and-effect relationships between multiple variables.
classification	A data approach that attempts to assign each unit in a population into a few categories potentially to help with predictions.
clustering	A data approach that attempts to divide individuals (like customers) into groups (or clusters) in a useful or meaningful way.
co-occurrence grouping	A data approach that attempts to discover associations between individuals based on transactions involving them.
data reduction	A data approach that attempts to reduce the amount of information that needs to be considered to focus on the most critical items (e.g., highest cost, highest risk, largest impact, etc.).
decision boundaries	Technique used to mark the split between one class and another.
decision support system	An information system that supports decision-making activity within a business by combining data and expertise to solve problems and perform calculations.
decision tree	Tool used to divide data into smaller groups.
descriptive analytics	Procedures that summarize existing data to determine what has happened in the past. Some examples include summary statistics (e.g., Count, Min, Max, Average, Median), distributions, and proportions.
diagnostic analytics	Procedures that explore the current data to determine why something has happened the way it has, typically comparing the data to a benchmark. As an example, these allow users to drill down in the data and see how they compare to a budget, a competitor, or trend.
digital dashboard	An interactive report showing the most important metrics to help users understand how a company or an organization is performing. Often created using Excel or Tableau.
dummy variable	A numerical value (0 or 1) to represent categorical data in statistical analysis; values assigned a 1 indicate the presence of something and 0 represents the absence.
effect size	Used in addition to statistical significance in statistical testing; effect size demonstrates the magnitude of the difference between groups.
interquartile range (IQR)	A measure of variability. To calculate the IQR, the data are first divided into four parts (quartiles) and the middle two quartiles that surround the median are the IQR.
link prediction	A data approach that attempts to predict a relationship between two data items.
null hypothesis	An assumption that the hypothesized relationship does not exist, or that there is no significant difference between two samples or populations.
overfitting	Amodeling error when the derived model too closely fits a limited set of data points.
predictive analytics	Procedures used to generate a model that can be used to determine what is likely to happen in the future. Examples include regression analysis, forecasting, classification, and other predictive modeling.
prescriptive analytics	Procedures that work to identify the best possible options given constraints or changing conditions. These typically include developing more advanced machine learning and artificial intelligence models to recommend a course of action, or optimizing, based on constraints and/or changing conditions.
profiling	A data approach that attempts to characterize the “typical” behavior of an individual, group, or population by generating summary statistics about the data (including mean, standard deviations, etc.).
regression	A data approach that attempts to estimate or predict, for each unit, the numerical value of some variable using some type of statistical model.
similarity matching	A data approach that attempts to identify similar individuals based on data known about them.
structured data	Data that are organized and reside in a fixed field with a record or a file. Such data are generally contained in a relational database or spreadsheet and are readily searchable by search algorithms.
summary statistics	Describe the location, spread, shape, and dependence of a set of observations. These commonly include the count, sum, minimum, maximum, mean or average, standard deviation, median, quartiles, correlation covariance, and frequency that describe a specific measurable value.
supervised approach/method	Approach used to learn more about the basic relationships between independent and dependent variables that are hypothesized to exist.
support vector machines	A discriminating classifier that is defined by a separating hyperplane that works first to find the widest margin (or biggest pipe).
test data	A set of data used to assess the degree and strength of a predicted relationship established by the analysis of training data.
time series analysis	A predictive analytics technique used to predict future values based on past values of the same variable.
training data	Existing data that have been manually evaluated and assigned a class, which assists in classifying the test data.
underfitting	A modeling error when the derived model poorly fits a limited set of data points.
unsupervised approach/method	Approach used for data exploration looking for potential patterns of interest.
XBRL (eXtensible Business Reporting Language)	A global standard for exchanging financial reporting information that uses XML.

Next up

Description

Resource summary

Similar

	Created by Matthew Evans almost 2 years ago