Kinds of variable
DSILYTC 1. Independent - any controlling data
2. Dependent - any data that is affected by
Statistics the controlling data
- a scientific study that deals with the collection, a. Moderating variable - affects the
analysis, and interpretation of data relationship between a predictor variable
- refer to numerical facts such as averages, and an outcome variable
medians, percentages, and maximums that help us b. Mediating variable - tells us how or why it
understand a variety of business and economic works, an intervening variable which
situations explains relationships between a predictor
variable and the criterion variable
Terms Example: to predict the value of sunlight
1. Population - the totality of objects under on the growth of a certain plant
investigation
2. Sample - a subset of a population Data - a compilation of facts,and figures or other
3. Parameter - a numerical description of contents, both numerical and non-numerical
population Information - data that have been organized,
4. Statistical inference - data used from a analyzed, and processes in a meaningful and
sample to make estimates and test purposeful way
hypothesis about the characteristics of the Knowledge - derived from a blend of data
population
Ex. a 2016 survey found out that 50% of Collection of data/ Types of Data
millennials plan to stay at their current job 1. Cross-sectional data
for more than a year - Collected by recording characteristics of
many subjects at the same point in time
Data and data sets - Recording a characteristic of many subjects
- Data are the facts and figures collected, at the same point in time
analyzed, and summarized for presentation
and interpretation 2. Time series data
- All the data collected in a particular study - Collected over several time periods focusing
are referred to as the data set for the study on certain groups of people specific events,
or objects
Elements, Variables, and Observations - Hourly, daily, weekly, monthly, quarterly, or
1. Elements - are the entities on which data annual observations
are collected
2. Variable - is a characteristic of interest for Level of measurement
the elements - The scale determines the amount of
3. Observation - set of measurements information contained in the data
obtained for a particular element - The scale indicates the data summarization
- A data set with n elements contains and statistical analyses that are most
n observation appropriate
- The total number of data values in a 1. Nominal scales
complete data set is the number of - are categorized using names, labels, or
elements multiplied by the number of qualities and cannot be arranged in any
variables particular order
- a nonnumerical label or numeric code may
be used
2. Ordinal scale
- can be arranged in order or rank of the
data
- the data have the properties of nominal
data and the order or rank of the data is
meaningful
Division of Statistics 3. Interval scale
1. Descriptive statistics - consists of - has a limit of measurement that data
methods for organizing, displaying, and permits us to describe how much more or
describing data by using tables graphs ad less one subject possesses than another. A
summary measures zero entry simply represents a position on a
2. Inferential statistics- consists of methods scale
that use sample results to help make - have the properties of ordinal data, and
predictions the interval between observations is
expressed in terms of a fixed unit of
measure
- interval data are always numeric
4. Ratio scale 3. Human or machine-generated
- modified internal level. A zero entry is - Structured human; price, income, retails
internet zero sales
- the data have all the properties of interval - Structured machine; sensors, speed
data, and the ratio of two values is cameras,we
meaningful -
- variables such as distance, height, weight,
and time use the ratio scale Business value and business process
- this scale must contain a zero value that 1. Business value - to all the items, events,
indicates that nothing exists for the variable and interactions that determine a company's
at the zero point financial health
2. Business process - is a coordinated,
standardized set of activities conducted by
Types of data both people and equipment to accomplish a
According to source specific business task
1. Primary data - information which is
gathered directly from the original source Companies perform thousands, and sometimes
2. Secondary data - information which is millions, of processes each day to create business
taken from the secondary source value
According to function Examples include
1. Qualitative/Categorial Data - Doordasg pays its drivers, who are
- consist of attributes, labels data or independent contractors, for making
nonnumerical entries deliveries
- used either the nominal or ordinal scale of - Tesla produces an electric vehicle to sell in
measurement china
- can be either numeric or nonnumeric - Procter & Gamble addresses customer
- appropriate statistical analysis is rather complaints, sometimes by giving refunds
limited -
2. Quantitative data Business analyst
- Indicate how many or how much - is a data specialist who curates and uses
- Always numeric data to help an organization make effective
- Ordinary arithmetic operations are business decisions
meaningful for quantitative data
a. Discrete data - if measuring how many Overview of Business Analytics
b. Continuous data - if measuring how much a. Descriptive Analytics - what has happened?
- Business intelligence
- Gather, organize, tabulate, visualize,
summarize
- Descriptive information can be presented in
a number of formats. (written reports,
tables, graphs, maps)
- Access and manipulate data through
reports, dashboards, application and
As to format visualization tools
1. Structured - Uses past data integrated from multiple
- Reside in a pre-defined, row-column format sources
- Spreadsheet or database applications - Inform decision-making and identify
- Enter, store, query, and analyze problems and solutions
- Numerical information that is objective and
not open to interpretation b. Diagnostic Analytics - why did it happen?
- Historically, companies relied mostly on
structured data c. Predictive Analytics - what could happen in
- High cost to store and process the future?
- Performance limitations - Use historical data to make predictions
- Analytical models help identify associations
2. Unstructured - Associations used to estimate the likelihood
- Do not conform to a pre-defoened, of a favorable outcome
row-column format - Commonly considered advanced
- Textual predictions
- Multimedia content - Build models that help an organization
- Do not conform to database structures understand what might happen in the future
- Use statistics and data mining
d. Prescriptive Analytics - what should we do?
(possible solutions)
- Optimization and simulation algorithms to
provide advice
- Explore several possible actions
- Suggest course of action
- Commonly considered advanced
predictions
- Build models that help an organization
understand what might happen in the future
- Use statistics and data mining
e. Adopting Analytics - evaluation
Frequency and Distribution - Using a bar or fixed width drawn above
A. Summarizing Data for a Categorical each class label, we extend the height
Variable appropriately
- The bars are separated to emphasize the
1. Frequency Distribution fact that each class is separate
- Is a tabular summary of data showing the
number (frequency) of observations in each 5. Pie chart
of several non-overlapping categories or - Is a commonly used graphical display for
classes presenting relative frequency for each class
- The objective is to provide insights about - First draw a circle; then use the relative
the data that cannot be quickly obtained by frequency to subdivide the circle into
looking only at the original data sectors that correspond to the relative
- Also called the frequency table frequency for each class
- Basis for creating many graphical displays. - Since there are 360 degrees in a circle, a
It is the initial summary of the raw data in class with a relative frequency of .25 would
which the data have been grouped for consume 0.25(360)=90 degrees of the circle
easier interpretation
B. Summarizing data for a Quantitative variable
Example 1. Frequency distribution
Soft drink purchases were asked to select one 1.1 Elements of frequency distribution table
among the five popular soft drinks:coca-cola, diet a. Number of classes
coke, [Link], Pepsi, and Sprite - Classes are formed by specifying ranges
Soft drinks selected by a sample of 20 purchasers that will be used to group the data
are: - As a general guideline, we reco using
Relative Percent Cumulative between 5 and 20 classes
Frequency Frequency Frequency Frequency - Largest data va;ie - west data value
Coca- - 33-12=21
Cola 2 0.1 10% 3 - 21/5 = 4.2
- No of classes = 5 (round it up always)
Diet
Coke 3 0.15 15% 5
b. Class limits
Pepsi 9 0.45 45% 14 - Class limits must be chosen so that each
[Link] data item belongs to one and only one class
per 5 0.25 25% 19 - The lower class limits identify the smallest
Sprite 1 0.05 5% 20 possible data value assigned to the class
- The upper class limits identify the largest
Total 20 1 100%
possible data value assigned to the class
- The appropriate values for the class limits
2. Relative Frequency Distribution
depend on the level of accuracy of the data
- The relative frequency of a class is the
- An open-end class requires only a lower
fraction or proportion of the total number of
class limit or an upper class limit
data items belonging to a class
- Is a tabular summary of a set of data
c. Class boundaries
showing the relative frequency for each
- the endpoints of an open interval which
class
contains the class interval
- such that the lower class boundary (LCB) is
the LCL minus one-half the tolerance -0.5
- the upper class boundary (UCB) is the UCL
3. Percent Frequency Distribution
plus one-half the tolerance +0.5
- The percent frequency of a class is the
relative frequency multiplied by 100
d. Class size (class width)
- Is a tabular summary of a set of data
- It refers to the difference between the upper
showing the percent frequency for each
class boundary and the lower class
class
boundary
4. Bar chart
- Is graphical display for depicting categorial
data - Class boundaries = 11.5 to 33.5
- On one axis (usally the horizontal axis), we - Class size 33.5-11.5
specify the labels that are used for each of - 22/5
the classes - 4.4
- A frequency, relative frequency, or
percent frequency scale can be used for
the other axis (usually the vertical axis)
e. Class mark (midpoint) c. Cumulative percent frequency
- the average of the upper limit and the lower distribution
limit of a class in a frequency distribution - shows the percentage of items with values
- Value halfway between the lower and upper less than or equal to the upper limit of each
class limits class
- Class mark = upper limit + lower limit / 2 - the last entry is always equal 100
2. Dot plot
- One of the simplest graphical summaries of
data is a dot plot
- A horizontal axis shows the range of data
values 5. Stem and leaf display
- Then each data value is represented by a - Shows both the rank order and shape of
dot placed above the axis the distribution of the data
- It is similar to a histogram on its side, but it
has the advantage of showing the actual
data values
- The first digits of each data item are
arranged to the left of a vertical line
- To the right of the vertical line, we record the
3. Histogram
last digit for each item in rank order
- The variable of interest is placed on the
- Stem - Each line (row) in the display
horizontal axis
- Leaf - Each digit on a stem
- A rectangle is drawn above each class
interval with its height corresponding to the
interval frequency, relative frequency, or
percent frequency
- Unike a bar graph, a histogram has no
natural separation between rectangle of
adjacent classes
a. Stretched stem and leaf display
- If we believe the original stem and leaf
display has condensed the data too much,
we can stretch the display vertically by
3.1 Histogram showing skewness using two stems for each reading digits
a. Symmetric - Whenever a stem value is stated twice, the
- Left tail is the mirror image of the right tail first value corresponds to leaf values of 0
- Example: height of the people to 4, and the second value corresponds to
b. Moderately skewed left leaf values 5 to 9
- A longer tail to the left
- Example: exam score
c. Moderately right skewed
- A longer tail to the right
- Example: housing values
d. Highly skewed right
- A very long tail to the right
- Example: executive salaries
4. Cumulative distributions
a. Cumulative frequency distribution
Leaf units
- shows the number of items with values
- A single digit is used to define each leaf
less than or equal to the upper limit of each
- In the preceding example, the leaf unit was
class
1
- the last entry is always equals the total
- Leaf units may be 100, 10, 1 0.1, and so on
number of observations
- Where the leaf unit is not shown, it is
b. Cumulative relative frequency
assumed to equal 1
distribution
- The leaf unit indicated how to multiply the
- shows the proportion of items with values
stem and leaf numbers in order to
less than or equal to the upper limit of each
approximate the original data
class
- the last entry is always equal to 1.00
D. Summarizing data for two variables using
graphical displays
1. Scatter diagrams and trendlines
- Is a graphical presentation of the
relationship between two quantitative
variables
~ one variable is shown on the horizontal
axis and the other variable is shown on the
vertical axis
~ the general pattern of the plotted points
suggests the overall relationship between
the variables
~ a trendline provides an approximation of
the relationship
- Useful in exploring the relationship between
two variables.
a. Positive relationship b. Negative relationship
C. Summarizing data for two variables using
tables
1. Crosstabulation
- Is a tabular summary of data for two
variables
c. No apparent relationship
- Crosstabulation can be used when
~ one variable is categorial and the other is
quantitative
~ Both variables are categorial, or
~ Both variables are quantitative
- The left and top margi labels define the
classes for the two variables
2. Side by side bar chart
- Is a graphical display for depicting multiple
bar charts on the same display
- Each cluster of bars represent one value of
the first variable
- Each bar within a cluster represents one
value of the second variable
Converting the entries in the table into row
percentages or column percentages can provide
additional insight about the relationship between
the two variables
3. Stacked bar chart
- Is another way to display and compare two
variables on the same display
- It is a bar chart in which each bar is broken
into rectangular segments od a different
a. Simpson’s Paradox color
- Data in two or more cross tabulation are - If percentage frwquncies are displayed, all
often aggregated to produce a summary bars will be of the same height (or length),
crosstabulation extending to the 100% mark
- We must be careful in drawing conclusions
about the relationship between the two
variables in the aggregated crosstabulation
- In some cases the conclusion based upon
an aggregated crosstabulation can be
completely reversed if we look at the
unaggregated data. The reversal of
conclusions based on aggregate and
unaggregate data
Data visualization: best practices in creating - Guidelines:
effective graphical displays ~ minimize the need for screen scrolling
~ avoid necessary use of color or 3D
Data visualization ~ use borders between charts to iprove
- the use of graphical displays to summarize readability
and present information about a data set
- The goal is to communicate as effective and
clearly as possible the key information
about the data
1. Creating effective graphical displays
- Creating effective graphics displays is as
much art as it is science
- Guidelines:
~ give the display a clear and concise title
~ Keep the display simple
~ clearly label each axis and provide the
units of measure
~ if colors are used, make sure they are
distinct
~ if multiple colors or lines are used, provide
a legend
Tabular and Graphical Displays
2. Choosing the type of graphical display
2.1 displays used to show the distribution of
data:
a. Bar chart - to show the frequency or
percent frequency for categorial data
b. Pie chart - to show the relative frequency
or percent frequency for categorial data
c. Dot plot - to show the distribution for
quantitative data over the entire range of
the data
d. Histogram - to show the frequency
distribution for quantitative data over a set
of class intervals
e. Stem-and-leaf display - to show both the
rank order and shape of the distribution
for quantitative data
2.2 displays used to make comparisons:
a. Side-by-side bar chart - to
compare two variables
b. Stacked bar chart - to compare the
relative frequency or percent
frequency of two categorial variables
2.3 displays used to show relationships:
a. Scatter diagram - to show the
relationship between two
quantitative variables
b. Trendline - to approximate the
relationship of data in a scatter
diagram
3. Dashboards
- Is widely used data visualization tool
- It organized and presents key performance
indicators (KPIs) used to monitor an
organization or process
- It provides timely, summary information that
is easy to read, understand, and interpret.
Descriptive Analytics d. Weighted mean
- In some instances, the mean is computed
Numerical Measures by giving each observation a weight that
a. Sample statistics - If the measure are reflects its relative importance
computed for data from a sample, they are - The choice of weights depends on the
called sample statistics application
b. Population parameters - If the measure - The weights might be the number of credit
are computed for data from a population hours earned for each grade, as in GPA
c. A sample statistic is referred to as the point - In other weighted mean computations,
estimator of the corresponding population quantities such as pounds, dollars, or
parameter volume are frequently used
1. Measures of Location
a. Mean
= AVERAGE (data cell range) xi = value of observation i
- the average of a set of n number and it is wi = weight of observation i
the sum of all numbers divided by n
- provides a measure of central location Numerator: sum of the weighted data values
- most imporant measure of location Denominator: sum of the weights
- sample mean x is the point estimator of ~ if the data is from a population, µ replaces x
the population mean µ Purchase Cost per pound Number of
(wi) Pounds (xi)
∑𝑥𝑖
𝑥 = 𝑛 1 3 1200 3600
2 3.4 500 1700
∑ 𝑥𝑖 = sum of the values of the n observations
3 2.8 2750 7700
n = number of observations in the sample
4 2.9 1000 2900
b. Median 5 3.25 800 2600
=MEDIAN (data cell range)
Total 6250 18500
- middle score for a set of data that has
been arranged in order of magnitude
(ascending order)
- is the measure of location most often
reported for annual income and property
value data
e. Geometric mean
- whenever a data set has extreme values,
= GEOMEAN (data cell range)
median is the preferred measure of central
- Is calculated by finding the nth root of the
location
(𝑛+1) product of n values
𝑥 =⎡ ⎤
- It is often used in analyzing growth rates in
⎣ 2 ⎦
Trimmed mean financial data (where using the arithmetic
- Another measure sometimes used when mean will provide misleading results)
extreme values are present - It is often used in analyzing growth rates in
- It is obtained by deleting a percentage of financial data (where using the arithmetic
the smallest and largest values from the mean will provide misleading results)
data set ad then computing the mean of the - It should be applied anytime you want to
remaining values determine the mean rate of change over
- For example, 5% trimmed mean is obtained several successive periods (years, quarters,
by removing the smallest 5% and the weeks)
largest 5% of the data values and then - Other common applications include
computing the mean of the remaining changes in populations of species, crop
values yields, pollution levels, and birth and death
rates
c. Mode
= MODE (data cell range)
- the measure or value which occurs most
frequently in a set of data. The value with
the greatest frequency
~ the greatest frequency can occur at two or
more different values
~ bimodal - if the data have exactly two
modes
~ multimodal - if the data have more than
two modes
c. Variance
f. Percentiles = VAR.S (data cell range)
= PERCENTILE>EXC (data range, p/100) - The variance is a measure of variability that
- Provides information about how the data are utilizes all the data
spread over the interval from the smallest - It is based on the difference between the
value to the largest value value of each observation (xi) and the mean
- Admission test scores for colleges and (x for a sample, µ for a population)
universities are frequently reported in terms - The variable is useful in comparing the
of percentiles variability of two or more variables
- The pth percentile of a data set is a value - Is the average of the squared differences
such that at least p% of the items take on between each data value and the mean
this value or less and at least (100-p)% of
the items take on this value or more
- Arrange the data in ascending order
𝑝
𝐿𝑝 = ( 100 )(𝑛 + 1)
d. Standard deviation
= STDEV.S (data cell range)
- The standard deviation of a data set is the
positive square root of the variance
- It is measured in the same units as the
data, making it more easily interpreted than
the variance
e. Coefficient of variation
g. Quartiles - It indicates how large the standard deviation
= [Link] (array, QUART) is in relation to the mean
- This are specific percentiles
First quartile = 25th percentile
Second quartile = 50th percentile = median
Third quartile = 75th percentile
3. Measures of Distribution Shape, Relative
Location, and Detecting Outliers
2. Measures of Variability
- It is often desirable to consider measures of
a. Skewness
variability (dispersion), as well as measures
- An important of the shape of a distribution
of location
- Can be easily computed using statistical
- For example, in choosing supplier A or
software
supplier B we might consider not only the
average delivery time for each but also the
variability in delivery time for each
a. Range
= Largest value - Smallest value
- The range of a data set is the difference a.1 Symmetric (not a.2 Moderately Skewed
between the largest and smallest data skewed) Left
values ~ Skewness is zero ~ Skewness is negative
- It is the simplest measure of variability ~ Mean and median ~ mean will usually be
- It is very sensitive to the smallest and are equal less than the median
largest data values
b. Interquartile Range
= Q3 - Q1
- The interquartile range of a data set is the
a.3 Moderately Skewed a.4 Highly Skewed
difference between the third quartile and the Right Right
first quartile ~ Skewnee is positive ~ Skewness is positive
- It is the range for the middle 50% of the ~ Mean will usually be (often above 1.0)
data more than the median ~ Mean will usually be
- It overcomes the sensitivity to extreme data more than the median
values
e. Detecting outliers
b. Z-scores - An outlier is an unusually small or unusually
- Often called the standardized value large value in a data set
- It denotes the number of standard - A data value with a z-score less than -3 or
deviations a data value xi, is from the mean greater than +3 might be considered an
outlier
- It might be:
~ an incorrectly recorded data value
- Excel’s STANDARDIZE function can be ~ a data value that was incorrectly included
used to compute the z-score in the data set
- An observation’s z-score is a measure of ~ a correctly recorded unusual data value
the relative location of the observation in a that belongs in the data set
data set
- A data value less than the sample mean will
have a z-score less than zero
- A data value greater than the sample mean
will have a z-score greater than zero
- A data value equal to the sample mean will
have a z-score of zero
c. Chebyshev’s Theorem
- At least (1-1/z2) of the items in any data set
4. Five-numbers summaries and boxplots
will be within z standard deviations of the
- Summary statistics and easy-t-draw graphs
mean, where z is any value greater than 1
can be used to quickly summarize large
- Chebyshev’s theorem requires z > 1, but z
quantities of data
need not be an integer
- Two tools that accomplish this are
- At least 75% of the data values must be
five-number summaries and boxplots
within z = 2 standard deviation of the mean
- At least 89% of the data values must be
a. Five Number Summary
within z = 3 standard deviations of the mean
- Smallest value
- At least 94% of the data values must be
- First quartile
within z = 4 standard deviations of the mean
- Median
- Third quartile
d. Empirical rule
- Largest value
When the data are believed to approximate a
bell-shaped distribution:
- The empirical rule can be used to determine
the percentage of data values that must be
within a specified number of standard
deviations of the mean
- The empirical rule is based on the normal
distributions
For data having a bell-shaped distribution:
b. Boxplots and Outliers
- Approximately 68% of the data values will
- Limits are located using the interquartile
be within +/-1 standard deviation of its mean
range (IQR)
- Approximately 95% of the data values will
- Data outside these limits are considered
be within+/-2 standard deviation of its mean
outliers
- Almost all (approximately 99.7%) of the data
- The locations of each outlier with the
values will be within +/-3 standard deviation
symbol
of its mean
Boxplot
- Is a graphical summary of data that is based
on a five-number summary
- A key to the development of a boxplot is the
computation of the median and the
quartiles, Q1 and Q3
- Boxplots provide another way to identify
outliers
5. Measures of Association Between Two
Variables
- Thus far we have examined numerical
methods used to summarize the data for
one variable at a time
- Often a manager or decision maker is
interested in the relationship between two
variables
- Two descriptive measures of the
relationship between two variables are
covariance and correlation coefficient
a. Covariance
= COVARIANCE.S (b data range, c data range)
- Is a measure of the linear association
between two variables
- Positive values = positive relationship
- Negative values = negative relationship
b. Correlation coefficient
= CORREL (b data range,c data range)
- Correlation is a measure of linear
association and not necessarily causation
- Just because two variables are highly
correlated, it does not mean that one
variable is the cause of the other
- The coefficient can take on values between
-1 and +1
- Values near -1 indicate a strong negative
linear relationship
- Values near +1 indicate a strong positive
linear relationship
- The closer the correlation is to zero, the
weaker the relationship