0% found this document useful (0 votes)
5 views19 pages

OLAP in Descriptive Analytics

Uploaded by

adifa mahmud
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views19 pages

OLAP in Descriptive Analytics

Uploaded by

adifa mahmud
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Nature of Data

Data is a collection of facts – usually obtained as the result of experiences, observations,


or experiments. Data are the main ingredient for any BI, data science, and business
analytics initiative. The raw material for what popular decision technologies produce—
information, insight, and knowledge. Data is the lowest level of abstraction (from which
information and knowledge are derived). Data quality and data integrity is critical to
analytics.

Metrics for Analytics Ready Data


▪ Data source reliability: The originality and appropriateness of the storage medium
where the data are obtained.
▪ Data content accuracy: Data are correct and are a good match for the analytics
problem.
▪ Data accessibility: The data are easily and readily obtainable.
▪ Data security and data privacy: Data security means that the data are secured to
allow only those people who have the authority and the need to access them and to
prevent anyone else from reaching them.
▪ Data richness: All required data elements are included in the data set. It also means
that the information content is complete (or near complete) to build a predictive
and/or prescriptive analytics model.
▪ Data consistency: The data are accurately collected and combined/merged.
▪ Data currency/data timeliness: The data should be up-to-date (or as recent/new as
they need to be) for a given analytics model.
▪ Data granularity: This requires that the variables and data values be defined at the
lowest (or as low as required) level of detail for the intended use of the data.
▪ Data validity: This is the term used to describe a match/mismatch between the
actual and expected data values of a given variable.
▪ Data relevancy: The variables in the data set are all relevant to the study being
conducted.

A Simple Taxonomy of Data


Data (datum in singular form) refers to a collection of facts usually obtained as the result of
experiments, observations, transactions, or experiences. Data can consist of numbers,
letters, words, images, voice recordings, and so on, as measurements of a set of variables
(characteristics of the subject or event that we are interested in studying). Data are often
viewed as the lowest level of abstraction from which information and then knowledge is
derived.
➢ At the highest level of abstraction, one can classify data as structured and
unstructured (or semi structured). Unstructured data/semi structured data are
composed of any combination of textual, imagery, voice, and Web content.
Structured data are what data mining algorithms use and can be classified as
categorical or numeric.
Unstructured/textual data – Targeted for humans to process/digest.
Structured data – Targeted for computers to process.
➢ The categorical data can be subdivided into nominal or ordinal data,
whereas numeric data can be subdivided into intervals or ratios.
➢ Nominal data contain measurements of simple codes assigned to objects as labels,
which are not measurements. For example, the variable marital status can be
generally categorized as (1) single, (2) married, and (3) divorced.
➢ Ordinal data contain codes assigned to objects or events as labels that also
represent the rank order among them. For example, the variable credit score can be
generally categorized as (1) low, (2) medium, or (3) high.
➢ Interval data are variables that can be measured on interval scales. A common
example of interval scale measurement is temperature on the Celsius scale.
➢ Ratio data include measurement variables commonly found in the physical sciences
and engineering. Mass, length, time, plane angle, energy, and electric charge are
examples of physical measures that are ratio scales.
Data Preprocessing Steps
A Summary of Data Preprocessing Tasks and Potential Methods

Statistical Modelling for Business Analytics


Descriptive analytics has two main branches: statistics and online analytics processing
(OLAP). OLAP is the term used for analyzing, characterizing, and summarizing structured
data stored in organizational databases (often stored in a data warehouse or in a data
mart) using cubes (i.e., multidimensional data structures that are created to extract a
subset of data values to answer a specific business question). The OLAP branch of
descriptive analytics has also been called business intelligence.
Statistics – A collection of mathematical techniques to characterize and interpret data.
Descriptive Statistics – Describing the data (as it is).
Inferential statistics – Drawing inferences about the population based on a sample data.
Relationship between Statistics and Descriptive Analytics

Descriptive Statistics for Descriptive Analytics


Descriptive statistics describes the basic characteristics of the data at hand, often one
variable at a time. Using formulas and numerical aggregations, descriptive statistics
summarizes the data in such a way that often meaningful and easily understandable
patterns emerge from the study. In business analytics, descriptive statistics plays a critical
role—it allows us to understand and explain/present our data in a meaningful manner
using aggregated numbers, data tables, or charts/graphs. In essence, descriptive statistics
helps us convert our numbers and symbols into meaningful representations for anyone to
understand and use. Such an understanding helps not only business users in their
decision-making processes but also analytics professionals and data scientists to
characterize and validate the data for other more sophisticated analytics tasks.
The methods in descriptive statistics can be classified as either measures for central
tendency or measures of dispersion.
Measures of Centrality Tendency (Also Called Measures
of Location or Centrality)
Measures of centrality are the mathematical methods by which we estimate or describe
central positioning of a given variable of interest. A measure of central tendency is a single
numerical value that aims to describe a set of data by simply identifying or estimating the
central position within the data.
Arithmetic Mean or Simple Average
The arithmetic mean (or simply mean or average) is the sum of all the values/observations
divided by the number of observations in the data set. It is by far the most popular and
most commonly used measure of central tendency. It is used with continuous or discrete
numeric data.

Median (The Number in the Middle)


The median is the measure of center value in a given data set. It is the number in the
middle of a given set of data that has been arranged/sorted in order of magnitude (either
ascending or descending). If the number of observations is an odd number, identifying the
median is very easy—just sort the observations based on their values and pick the value
right in the middle. If the number of observations is an even number, identify the two
middle values, and then take the simple average of these two values. The median is
meaningful and calculable for ratio, interval, and ordinal data types. Once determined, one-
half of the data points in the data is above and the other half is below the median. In
contrary to the mean, the median is not affected by outliers or skewed data.
Mode
The mode is the observation that occurs most frequently (the most frequent value in our
data set). The mode is most useful for data sets that contain a relatively small number of
unique values. Although it is a useful measure (especially for nominal data), mode is not a
very good representation of centrality, and therefore, it should not be used as the only
measure of central tendency for a given data set.
Which central tendency measure is the best?
— Use the mean when the data are not prone to outliers and there is no significant level of
skewness.
— Use the median when the data have outliers and/or it is ordinal in nature.
— Use the mode when the data are nominal.

Perhaps the best practice is to use all three together so that the central tendency of the
data set can be captured and represented from three perspectives.
It is a better practice to present descriptive statistics as a package—a combination of
centrality and dispersion measures—as opposed to a single measure such as mean.

Measures of Dispersion (Also Called Measures of Spread or


Decentrality)
Measures of dispersion are the mathematical methods used to estimate or describe the
degree of variation in a given variable of interest. They represent the numerical spread
(compactness or lack thereof) of a given data set.
Range
The range is perhaps the simplest measure of dispersion. It is the difference between the
largest and the smallest values in a given data set (i.e., variables). So, we calculate range
by simply identifying the smallest value in the data set (minimum), identifying the largest
value in the data set (maximum), and calculating the difference between them (range =
maximum- minimum).
Variance
A more comprehensive and sophisticated measure of dispersion is the variance. It is a
method used to calculate the deviation of all data points in a given data set from the
mean. To prevent the offsetting of negative and positive differences, the variance takes
into account the square of the distances from the mean. The formula for a data sample can
be written as
where n is the number of samples, x is the mean of the sample, and xi is the ith value in the
data set. The larger values of variance indicate more dispersion, whereas smaller values
indicate compression in the overall data set.
Standard Deviation
The standard deviation is also a measure of the spread of values within a set of data. The
standard deviation is calculated by simply taking the square root of the variations. The
following formula shows the calculation of standard deviation from a given sample of data
points.

Mean Absolute Deviation (MAD)


It is a simpler way to calculate the overall deviation from the mean. Specifically, the mean
absolute deviation is calculated by measuring the absolute values of the differences
between each data point and the mean and then summing them. This process provides a
measure of spread without being specific about the data point being lower or higher than
the mean. The following formula shows the calculation of the mean absolute deviation:

Quartiles and Interquartile Range


A quartile is a quarter of the number of data points given in a data set. Quartiles are
determined by first sorting the data and then splitting the sorted data into four disjoint
smaller data sets.
A common way of expressing quartiles is as an interquartile range, which describes the
difference between the third quartile (Q3) and the first quartile (Q1), telling us about the
range of the middle half of the scores in the distribution.
Box-and-Whiskers Plot
The box-and-whiskers plot (or simply a box plot) is a graphical illustration of several
descriptive statistics about a given data set. They can be either horizontal or vertical, but
vertical is the most common representation, especially in modern-day analytics software
products. The box plot shows the centrality (median and sometimes also mean) as well as
the dispersion (the density of the data within the middle half—drawn as a box between the
first and third quartiles), the minimum and maximum ranges (shown as extended lines from
the box, looking like whiskers, that are calculated as 1.5 times the upper or lower end of
the quartile box), and the outliers that are larger than the limits of the whiskers.
Understanding the Specifics about Box-and-Whiskers Plots

Shape of a Distribution
Distribution is the frequency of data points counted and plotted over a small number of
class labels or numerical ranges. A very well-known distribution is called normal
distribution, which is perfectly symmetric on both sides of the mean and has numerous
well-founded mathematical properties that make it a very useful tool for research and
practice.
There are two commonly used measures to calculate the shape characteristics of a
distribution: skewness and kurtosis.
A histogram (frequency plot) is often used to visually illustrate both skewness and kurtosis.
Skewness is a measure of asymmetry (sway) in a distribution of the data that portrays a
unimodal structure—only one peak exists in the distribution of the data. Because normal
distribution is a perfectly symmetric unimodal distribution, it does not have skewness; that
is, its skewness measure (i.e., the value of the coefficient of skewness) is equal to zero. The
skewness measure/value can be either positive or negative. If the distribution sways left
(i.e., the tail is on the right side and the mean is smaller than median), then it produces a
positive skewness measure; if the distribution sways right (i.e., the tail is on the left side
and the mean is larger than median), then it produces a negative skewness measure.

Kurtosis is another measure to use in characterizing the shape of a unimodal distribution.


As opposed to the sway in shape, kurtosis focuses more on characterizing the
peak/tall/skinny nature of the distribution. Specifically, kurtosis measures the degree to
which a distribution is more or less peaked than a normal distribution. Whereas a positive
kurtosis indicates a relatively peaked/tall distribution, a negative kurtosis indicates a
relatively flat/short distribution. As a reference point, a normal distribution has a kurtosis
of 3. The formula for kurtosis can be written as

Regression Modeling for Inferential Statistics


Regression, especially linear regression, is perhaps the most widely known and used
analytics technique in statistics. It is a part of inferential statistics. It is used to characterize
relationship between explanatory (input) and response (output) variable. It can be used
for—
Hypothesis testing (explanation)—investigating potential relationships between different
variables.
Forecasting (prediction)—estimating values of a response variable based on one or more
explanatory variables.
CORRELATION VERSUS REGRESSION
Correlation gives an estimate on the degree of association between the variables.
Regression attempts to describe the dependence of a response variable on one (or more)
explanatory variables.
Correlation is interested in the low-level relationships between two variables
Regression is concerned with the relationships between all explanatory variables and the
response variable.

SIMPLE VERSUS MULTIPLE REGRESSION


If the regression equation is built between one response variable and one explanatory
variable, then it is called simple regression.
Multiple regression is the extension of simple regression when the explanatory variables
are more than one.
In both cases, the relationship between the response variable and the explanatory
variable(s) is linear and additive in nature.

How do we develop linear regression models?


The simplest thing that one can do is to draw a
Scatter Plot where the y-axis represents the values of the response variable and the x-axis
represents the values of the explanatory variable. Simple regression analysis aims to find a
mathematical representation of this relationship.
Ordinary Least Squares (OLS) method aims to minimize the sum of squared residuals
(squared vertical distances between the observation and the regression point) and leads to
a mathematical expression for the estimated value of the regression line (which are known
as β parameters). For simple linear regression, the aforementioned relationship between
the response variable (y) and the explanatory variable(s) (x) can be shown as a simple
equation as follows:
y = β0 + β1x
In this equation, β0 is called the intercept, and β1 is called the slope.

How Do We Know If the Model Is Good Enough?


For the numerical assessment, three statistical measures are often used in evaluating the
fit of a regression model: R2(R- squared), the overall F-test, and the Root Mean Square
Error (RMSE). All three of these measures are based on the sums of the square errors (how
far the data are from the mean and how far the data are from the model’s predicted
values). Of the three, R2 has the most useful and understandable meaning because of its
intuitive scale.

A Process Flow for Developing Regression Models

Regression Modelling Assumptions


1. Linearity: This assumption states that the relationship between the response
variable and the explanatory variables are linear.
2. Independence (of errors): This assumption states that the errors of the response
variable are uncorrelated with each other.
3. Normality (of errors): This assumption states that the errors of the response variable
are normally distributed.
4. Constant variance (of errors): This assumption, also called homoscedasticity, states
that the response variables have the same variance in their error regardless of the
values of the explanatory variables.
5. Multicollinearity: This assumption states that the explanatory variables are not
correlated (i.e., do not replicate the same but provide a different perspective of the
information needed for the model).

Logistic Regression Modelling


A very popular, statistically sound, probability-based classification algorithm that
employs supervised learning. It is a complement to linear regression and linear
discriminant analysis methods. Logistic regression differs from linear regression with
one major point: its output (response variable) is a class as opposed to a numerical
variable.
If there is only one predictor variable and one predicted variable, the method is called
simple logistic regression (similar to calling linear regression models with only one
independent variable simple linear regression).
The Logistic Function

The logistic function, is the core of logistic regression, which can take values only
between 0 and 1. The following equation is a simple mathematical representation of
this function:
Time Series Forecasting
A time series is a sequence of data points of the variable of interest, measured and
represented at successive points in time spaced at uniform time intervals. Examples of time
series include monthly rain volumes in a geographic area, the daily closing value of the
stock market indexes.
Time-series forecasting is the use of mathematical modeling to predict future values of the
variable of interest based on previously observed values.
The time-series plots/charts look and feel very similar to simple linear regression.
In time series there are two variables: the response variable and the time variable
presented in a scatter plot. Beyond this appearance similarity, there is hardly any other
commonality between the two.

Business Reporting
Information is essentially the contextualization of data. Decision makers need information
to make accurate and timely decisions. The information is usually provided to decision
makers in the form of a written report (digital or on paper), or orally.
A report is any communication artifact prepared with the specific intention of conveying
information in a digestible form to whoever needs it whenever and wherever.
It might be organized in a narrative, graphic, and/or tabular form, prepared periodically
(recurring) or on an as-needed (ad hoc) basis, referring to specific time periods, events,
occurrences, or subjects.

Business Report: A written document that contains information regarding business matters.
➢ Purpose: To improve managerial decisions
➢ Source: Data from inside and outside the organization (via the use of ETL—
Extract, Transform, Load)
➢ Format: text, tables, graphs/charts
➢ Distribution: in-print, email, portal/intranet
Data acquisition → Information generation → Decision making → Process management
❖ Different functions of Business Report:
✓ To ensure proper departmental functioning
✓ To provide information
✓ To provide the results of an analysis
✓ To persuade others to act
✓ To create an organizational memory (as part of a knowledge management
✓ system).
Types of Business Reports
Metric Management Reports: Help manage business performance through metrics (SLAs—
Service Level Agreement for externals; KPIs—Key Performance Indicator for internals).
Can be used as part of Six Sigma and/or TQM.
Dashboard-Type Reports: Graphical presentation of several performance indicators in a
single page using dials/gauges.
Balanced Scorecard–Type Reports: Include financial, customer, business process, and
learning & growth indicators.

Data Visualization
The use of visual representations to explore, make sense of, and communicate data.
Since, what is portrayed in visualizations is the information (aggregation, summarization,
and contextualization of data), not the data, the two terms data visualization and
information visualization are used interchangeably and synonymously.
Related to information graphics, scientific visualization, and statistical graphics. Includes
charts, graphs, illustrations.

A Brief History of Data Visualization


Data visualization can date back to the second century AD. Most developments have
occurred in the last two and a half centuries. William Playfair is widely credited as the
inventor of the modern chart, having created the first line and pie charts.
Until recently it was not recognized as a discipline.
Today’s most popular visual forms date back a few centuries.

Types of Charts and Graphs

LINE CHART
The line chart is perhaps the most frequently used graphical visuals for time-series data.
Line charts (or line graphs) show the relationship between two variables.
BAR CHART
The bar chart is among the most basic visuals used for data representation. They are
effective when you have nominal data or numerical data that split nicely into different
categories so you can quickly see comparative results and trends within your data.
PIE CHART
The pie chart is visually appealing, as the name implies, pie-looking charts. Because they
are so visually attractive, they are often incorrectly used. Pie charts should be used only to
illustrate relative proportions of a specific measure.
SCATTER PLOT
The scatter plot is often used to explore the relationship between two or three variables (in
2D or 3D visuals). Because scatter plots are visual exploration tools, translating more than
three variables into more than three dimensions is not easily achievable.
BUBBLE CHART
The bubble chart is often an enhanced version of scatter plots. Bubble charts, though, are
not a new visualization type; instead, they should be viewed as a technique to enrich data
illustrated in scatter plots (or even geographic maps).

Specialized Charts and Graphs


HISTOGRAM
A histogram looks just like a bar chart. Histograms are used to show the frequency
distribution of one variable or several variables.
GANTT CHART
A Gantt chart is a special case of horizontal bar charts used to portray project timelines,
project tasks/activity durations, and overlap among the tasks/activities.
PERT CHART (Program Evaluation and Review Technique)
The PERT chart (also called a network diagram) is developed primarily to simplify the
planning and scheduling of large and complex projects. A PERT chart shows precedence
relationships among project activities/tasks. It is composed of nodes (represented as
circles or rectangles) and edges (represented with directed arrows).
GEOGRAPHIC MAP
When the data set includes any kind of location data (e.g., physical addresses, postal
codes, state names or abbreviations, country names, latitude/longitude, or some type of
custom geographic encoding), it is better and more informative to see the data on a map.
Maps usually are used in conjunction with other charts and graphs rather than by
themselves.
BULLET
A bullet graph is often used to show progress toward a goal. This graph is essentially a
variation of a bar chart.
HEAT MAP
The heat map is a great visual to illustrate the comparison of continuous values across two
categories using color. The goal is to help the user quickly see where the intersection of
the categories is strongest and weakest in terms of numerical values of the measure being
analyzed.
HIGHLIGHT TABLE
The highlight table is intended to take heat maps one step further. In addition to showing
how data intersect by using color, highlight tables add a number on top to provide
additional detail.
TREE MAP
A tree map displays hierarchical (tree-structured) data as a set of nested rectangles. Each
branch of the tree is given a rectangle, which is then tiled with smaller rectangles
representing subbranches. A leaf node’s rectangle has an area proportional to a specified
dimension on the data.

Which Chart or Graph Should You Use?


Visual Analytics
A recently coined term that is often used loosely to mean nothing more than information
visualization. What is meant by visual analytics is the combination of visualization and
predictive analytics. Whereas information visualization is aimed at answering “What
happened?” and “What is happening?” and is closely associated with BI (routine reports,
scorecards, and dashboards), visual analytics is aimed at answering “Why is it happening?”
and “What is more likely to happen?” and is usually associated with business analytics
(forecasting, segmentation, correlation analysis). There is a strong move toward visual
analytics.

INFORMATION DASHBOARDS/PERFORMANCE DASHBOARDS


Dashboards provide visual displays of important information that is consolidated and
arranged on a single screen so that the information can be digested at a single glance and
easily drilled in and further explored.
Information dashboards are common components of most, if not all, BI or business
analytics platforms, business performance management systems, and performance
measurement software suites.

Dashboard Design
The fundamental challenge of dashboard design is to display all the required information
on a single screen, clearly and without distraction, in a manner that can be assimilated
quickly. The most distinctive feature of a dashboard is its three layers of information:
1. Monitoring: Graphical, abstracted data to monitor key performance metrics.
2. Analysis: Summarized dimensional data to analyze the root cause of problems.
3. Management: Detailed operational data that identify what actions to take to resolve
a problem.

What to look for in a dashboard


– Use of visual components to highlight data and exceptions that require action
– Transparent to the user, meaning that they require minimal training and are extremely
easy to use
– Combine data from a variety of systems into a single, summarized, unified view of the
business
– Enable drill-down or drill-through to underlying data sources or reports
– Present a dynamic, real-world view with timely data– Require little coding to implement,
deploy, and maintain

Best Practices in Dashboard Design


o Benchmark KPIs with Industry Standards
o Wrap the Metrics with Contextual Metadata
o Validate the Design by a Usability Specialist
o Prioritize and Rank Alerts and Exceptions
o Enrich Dashboard with Business-User Comments
o Present Information in Three Different Levels
o Pick the Right Visual Constructs
o Provide for Guided Analytics

You might also like