Statistics (S1101)
By
Abbas MOURAD
S1100
❑ Chapter 1 : Basic Statistical Concepts
➢ Some definitions
❑ Chapter 2 : Basic Statistical Concepts
➢ Graphical representation of data : Bar chart, Strip chart, Histogram, Boxplot.
➢Numercial representaion of data: Mean, Median, mode, variance, quantiles,
skweness and kurtosis
Chapter 3: Two-dimensional Descriptive Statistics
➢ Data Representation, Coefficient of correlation and simple linear regression
❑ Chapter 4: Combinatorial analysis & Probability
➢ Permutations, Arrangements and Combinations
➢Basic Concepts of Probability, Conditional probability and Bayes Theorem
2
Chapter 1:
Basic Statistical Concepts
What is Statistics?
Data Why?
1. Collecting Data
Analysis
e.g., Survey
2. Presenting Data
e.g., Charts & Tables
Decision-
3. Characterizing Data
Making
e.g., Mean, Median
© 1984-1994 T/Maker Co.
4
What is Statistics?
❑ Statistics: The science of collecting, describing, and
interpreting data.
Two areas of statistics:
Statistical Methods
Descriptive Statistics
Inferential Statistics
➢ Descriptive Statistics: collection, presentation, and
description of sample data.
➢ Inferential Statistics: making decisions and drawing
conclusions about populations.
5
Descriptive Statistics
1. Involves
• Collecting Data $
• Presenting Data 50
• Characterizing Data
25
2. Purpose
• Describe Data 0
Q1 Q2 Q3 Q4
Mean= 30.5 S2 = 113
6
Inferential Statistics
1. Involves
• Estimation
• Hypothesis
Testing
2. Purpose
Population?
• Make decisions about
population characteristics
7
Population & Sample
Sample 4
Size : n Sample 3
Size : n
Sample 1
Size : n
Population
Sample 2
Size : N Size : n
❑ Understand that statistics can be used to gain information about a
population by examining a sample of the population ; generalizations
about a population from a sample are valid only if the sample is
representative of that population.
8
Simple Random Sample
Every subset of a specified size n from the
population has an equal chance of being selected
9
Types of Data
Types of
Data
Quantitative Qualitative
Data Data
Quantitative data are measurements that are
recorded on a naturally occurring numerical scale.
Qualitative data are measurements that cannot be
measured on a natural numerical scale; they can only
be classified into one of a group of categories.
10
Type of variables
❑ Example: The table below represents employees of a
communications company.
Number of Age Gender Number of children Level of satisfaction
employees
1 25 M 0 Satisfied
2 35 M 2 Very Satisfied
3 24 F 0 Satisfied
4 52 M 3 Unsatisfied
…. …. …. …. ….
99 48 F 2 Very Satisfied
100 26 F 1 Satisfied
▪ Population: Employees of a communications company
▪ Sample: 100 employees of a communications company
▪Individual: 1 employee
▪ Variables: Age, Gender, Number of children and Level of satisfaction
11
Descriptive Statistics
❑ Two kinds of variables:
➢ Qualitative, or Categorical, Variable: A variable that categorizes
or describes an element of a population.
➢ Quantitative, or Numerical, Variable: A variable that quantifies
an element of a population.
Nominal (Gender with 2 levels: M & F)
Qualitative
Ordinal (Level of satisfaction with 5 levels)
Variable
Discrete (Number of children)
Quantitative
Continuous (Age)
12
Descreptive Statistics
Nominal Variable: A qualitative variable that categorizes (or
describes, or names) an element of a population.
Ordinal Variable: A qualitative variable that incorporates an
ordered position, or ranking.
Discrete Variable: A quantitative variable that can assume a
countable number of values. Intuitively, a discrete variable can
assume values corresponding to isolated points along a line
interval. That is, there is a gap between any two values.
Continuous Variable: A quantitative variable that can assume
an uncountable number of values. Intuitively, a continuous
variable can assume any value along a line interval, including
every possible value between any two values
13
Chapter 2:
Univariate Statistical Analysis
Basic Notations
k
k k
Xi ni fi pi Ni Ni
[0-10[ 2 2/10 = 0.2 0.2×100 = 20% + 2 8 +2 =10
[10-20[ 4 0.4 40% 2+4 = 6 4+4 =8
[20-30[ 3 0.3 30% 6+3 = 9 1+3 =4
[30-40[ 1 0.1 10% 9+1= 10 1
Total n =10 1 100% ------- -------
15
Descriptive Statistics
Descriptive Statistics are Used by Researchers to
Report on Populations and Samples
Types of descriptive statistics:
✓ Organize Data (Qualitative Variable)
▪ Tables
▪Graphs
✓Summarize Data (Quantitative Variable)
▪Central Tendency
▪Variation
16
Representation of qualitative variable
✓Tables
Frequency Distributions
✓Qualitative Variable : (Percentage)
✓Graphs
▪ Measurement scale: Nominal Bar Chart or Pie Chart
✓ Measures of Central
Gender
Tendency
❑ Male Mode
❑ Female
✓Tables
▪ Measurement scale: Ordinal Frequency Distributions
What is your level of satisfaction with (Percentage) and cumulative
public transport in Lebanon? frequency
❑ Very Satisfied ✓Graphs
Bar Chart or Pie Chart
❑Satisfied
✓ Measures of Central
❑ Neither satisfied/unsatisfied Tendency
❑ Unsatisfied Mode and Median
❑ Very Unsatisfied
17
Descriptive Statistics : Qualitative variable
Example : Variable Gender : Bar Chart and/or Pie Chart
Gender ni fi pi Angle = fi × 360°
Male 33 0.236 23.6% 84.96°
Female 107 0.764 76.4% 275.04°
Total n =140 100 100% 360°
Bar Chart Pie Chart
Descriptive Statistics: Quantitative Discrete variable
✓Quantitative Discrete Variable : ✓Measures of Central Tendency:
Mode, median, mean, Variance, ….
✓Graphs : Strip Chart
Number of ni fi Ni
persons
1 5 0.1 5
2 9 0.18 14
3 15 0.3 29
4 10 0.2 39
5 6 0.12 45
6 3 0.06 48
7 0 0 48
8 2 0.04 50
Total 50 1 …
19
Descriptive Statistics: Quantitative Discrete variable
✓Quantitative Discrete Variable : Cumulative diagram or cumulative distribution function
Number of ni fi Ni
persons
1 5 0.1 5
2 9 0.18 14
3 15 0.3 29
4 10 0.2 39
5 6 0.12 45
6 3 0.06 48
7 0 0 48
8 2 0.04 50
Total 50 1 …
20
Quantitative continuous Variable
Histogram
&
Ogive
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Measures of Central Tendency:
Mode, median, mean, Variance, ….
Graphs : Histogram
Same width : Based on ni
Histogram
Different widths : Based on di
22
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Histogram (Same width)
Frequency
Classes ni Width: ai Polygon
[2-6[ 60 6-2 = 4
[6-10[ 80 10-6= 4
[10-14[ 130 14-10= 4
[14-18[ 100 18-14= 4
[18-22[ 30 22-18= 4
Total 400 …..
23
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Histogram (Different widths)
Classe ni Width : di = ni/ai
s ai
[2-3[ 60 3-2 = 1 60/1 = 60
[3-4[ 80 4-3= 1 80/ 1 = 80
[4-8[ 130 8-4= 4 130/4 = 32.5
[8-16[ 100 8 100/8 = 12.5
[16-22[ 30 6 30/6 = 5
Total 400 ….. …..
The Frequency polygon in case of different widths is based on
the greatest common divisor (gcd) of ai :
In our example: gcd(ai ) = 1
24
Descriptive Statistics: Quantitative Cont. variable
The Frequency polygon in case of different widths is based on
the greatest common divisor (gcd) of ai :
In our example: gcd(ai ) = 1
25
Descriptive Statistics: Quantitative Cont. variable
Frequency
Polygon
The Frequency polygon in case of different widths is based on
the greatest common divisor (gcd) of ai :
In our example: gcd(ai ) = 1
26
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Cumulative Frequencies Curve (Ogive)
Classes ni Ni
[2-3[ 60 60
[3-4[ 80 140
[4-8[ 130 270
[8-16[ 100 370
[16-22[ 30 400
Total 400 …..
27
The Numerical Characteristics
Measures of Central Tendency : The Mean
Measures of Central Tendency : The Mean
Non grouped data:
Example: 2, 4, 5, 1, 3 =>
Grouped data: The distribution is given as a frequency table
Example:
Xi ni
2 1
3 5
4 2
5 2
Total 10
29
Measures of Central Tendency : The Mean
Grouped data by classes : the data are grouped into classes, ci is the class midpoint, the mean is
Classes ni Center of
class: ci
[2-6[ 60 6+2 /2= 4
[6-10[ 80 10+6/2= 8
[10-14[ 130 14+10/2= 12
[14-18[ 100 14+18/2= 16
[18-22[ 30 22+18/2= 20
Total 400 …..
30
The Numerical Characteristics
Measures of Central Tendency : The Mode
Measures of Central Tendency : The Mode
The mode is the most repeated value in a distribution
Non grouped data
Example 1: 2, 2, 4, 5, 8, 12, 2, 1, 2 => Mode = 2
Example 2: 2, 1, 4, 5, 8, 12, 2, 1, 1 => Mode = 1 & 2
Example 3: 2, 4, 6, 8, 1, 10, 22, 45 => Mode doesn't exist
Grouped data
Xi ni
2 1
The most repeated value i.e. the value with highest frequency
3 5
4 2 => Mode = 3
5 2
Total 10
32
Measures of Central Tendency : The Mode
The mode is the most repeated value in a distribution
Grouped data : Graphically
Strip Chart
Xi ni
6
2 1 5
3 5 4
4 2 3
ni
5 2 2
Total 10 1
0
0 1 2 3 4 5 6
Xi
Mode = 3
33
Measures of Central Tendency : The Mode
Grouped data by classes : In the case of grouped data by classes, the mode is affected by the
width (same or different)
Same Width:
Different Widths:
Example (Same Width)
Classes ni Width (ai)
Step 1: Modal Class: The class with highest frequency
[2-6[ 60 4 => Modal class = [10-14[
[6-10[ 80 4
[10-14[ 130 4 Step 2 :
[14-18[ 100 4
[18-22[ 30 4
Total 400
34
Measures of Central Tendency : The Mode
Example (Same Width) : Graphically
Classes ni Width (ai)
[2-6[ 60 4
[6-10[ 80 4
[10-14[ 130 4
[14-18[ 100 4
[18-22[ 30 4
Total 400
Mode ≈12.5
35
Measures of Central Tendency : The Mode
Example (Different Widths) Step 1: Modal Class: The class with highest density
=> Modal class = [3-4[
Classe ni Width : di = ni/ai
s ai
Step 2 :
[2-3[ 60 3-2 = 1 60/1 = 60
[3-4[ 80 4-3= 1 80/ 1 = 80
[4-8[ 130 8-4= 4 130/4 = 32.5
[8-16[ 100 8 100/8 = 12.5
[16-22[ 30 6 30/6 = 5
Total 400 ….. …..
Graphically
Mode ≈3.3
The Numerical Characteristics
Measures of Central Tendency : The Median
Measures of Central Tendency : The Median
The median is the middle value in the distribution when the numbers
have been arranged into increasing order
Case 1- Ungrouped data
Step 1 : Order observations from smallest to largest : x(1) ≤ x(2) ≤ … ≤ x(2)
Step 2 : The median is
Example 1 : n is odd (n= 13) Example 2 : n is even (n=8)
1232363445443 5 6 8 10 9 9 4 7
Step1 (ordering): 1 2 2 3 3 3 3 4 4 4 4 5 6 Step1 (ordering): 4 5 6 7 8 9 9 10
Step 2: n is even =>
Step 2: n is odd => The integer part of [13/2 = 6.5] Me = (x(4) + x(5))/2= (7+ 8)/2 = 7.5
is 6 => Me = x(6+1) = x(7) = 3 (the 7th value of ordered (the median is the value between the 4th and 5th
data) observation of ordered data)
38
Measures of Central Tendency : The Median
Case 2- Grouped data
Step 1 : Calculate the increasing cumulative frequency : Ni
Step 2 : Calculate n/2 :
If n/2 belongs the increasing cumulative frequency, we take the value
of X associated with n/2
Elsewhere, we take the value of X associated the increasing
cumulative frequency immediately greater than n/2
Example 3 : Data in the right Table
Step1: Calculate Ni
Step 2: n/2 = 50/2 = 25 =>
Me = (x(25) + x(26) )/2 = (3+3) /2 = 3
39
Measures of Central Tendency : The Median
Median graphically – Example 3
50
40
30
n/2 = 50 /2 = 25
20
10
Median = 3
40
Measures of Central Tendency : The Median
Example 4 : Data in the right Table xi ni Ni
Step1: Calculate Ni 10 10 10
Step 2: n/2 = 24/2 = 12 => 11 2 12
Me = (x(12) + x(13) )/2 = (11+ 12) /2 = 11.5
12 8 20
13 4 24
Total 24
25
20
15
n/2 = 24/2 = 12
10
10 11 11.5 12 13
41
Measures of Central Tendency : The Median
Case 2- Grouped data by classes – Continuous Variable
Step 1 : Calculate the increasing cumulative frequency (Ni)
Find the Median Class: Calculate n/2 : If n/2 belongs the increasing cumulative frequency, we take
the value of X associated with n/2
Elsewhere, we take the value of X associated the increasing cumulative frequency immediately
greater than n/2
Step 2:
Classe ni Ni Ni
s
Example 5 : Data in the right Table
[10-20[ 10 10 80
Step1: Find the median class
[20-30[ 25 35 70
n/2 = 80/2 = 40 => Me ∊ [30-40[
[30-40[ 15 50 45
Step 2 :
[40-50[ 15 65 30
[50-60[ 15 80 15
Total 80
14
Measures of Central Tendency : The Median
Median graphically – Grouped data by classes
Ni
Median
15
Measures of Central Tendency : Properties
Properties of the mean:
The extreme values, called outliers, in the data set have an effect on the
mean. (When outliers are present, the mean is “pulled” in the direction of
the outlier)
The mean obtained after grouping data into classes differs from the mean
obtained from ungrouped data
The median:
Unlike the mean, extreme values (outliers) have no effect on the median
Like the mean and the mode, the median obtained after grouping data
into classes differs from the mean obtained from ungrouped data
Properties of the mode:
is not affected by extreme values
is very sensitive to variations in class width (amplitude)
may not be unique ; may not exist
Like the mean, the mode obtained after grouping data into classes differs
from the mode obtained from ungrouped data
The Numerical Characteristics
Measures of dispersion:
The range, the variance and the Inter-
quartile range
Measures of dispersion
Range
Variance and Standard deviation
Coefficient of Variation
Quantiles (Quartiles, Deciles and percentiles)
46
Measures of dispersion : The Range
Difference between the maximum and the minimum of the set of the
data : Range = Highest value – Lowest value = X(n) – X(1)
The larger the range, the more dispersion in a set of data
Affected by the extreme values
Example 2
Example 1 : Classes ni
1232363445443 [10-20[ 10
[20-30[ 25
Range = 6 – 1 = 5
[30-40[ 15
[40-50[ 15
[50-60[ 15
Total 80
Range = 60 – 10 = 50
47
Measures of dispersion : The Variance
Case 1- Non Grouped data
Case 2- Grouped data
Case 3- Grouped data into classes
48
Measures of dispersion : The Standard deviation
Properties of variance and standard deviation
The higher the variance (standard deviation), the more spread out the data:
The values become more dispersed
95% observation belong to
If the variance is equal to zero, then the values are equal
The variance (resp. standard deviation) is affected by the extreme values
The variance obtained after grouping data into classes differs from the
variance of data before grouping The
dispersion
around the
mean
49
Measures of dispersion : Coefficient of variation
Is a measurement without unit and defined by
Expressed as a percentage
The higher the coefficient of variation, the greater the level of
dispersion
It allows for comparison between distributions of values whose scales
of measurement are not comparable
Example :
10 individuals
Age expressed in years
Salary expressed in dollars
Age is more
dispersed than
the salary
50
Measures of dispersion : Quartiles
Values that divide a ranked data into several equal parts
Quartiles : divide a distribution into four equal parts : Q1, Q2, Q3
Deciles : divide a distribution into 10 equal parts
Percentiles : Percentiles are quantiles that divide a distribution
into 100 equal parts
51
Measures of dispersion : Quartiles
Quartiles: Divide the distribution into 4 equal parts. There are 3
quartiles: Q1 (25% accumulated) , Q2 = Me (50% accumulated), Q3 (75%
accumulated).
Case 1- Non Grouped data
52
Measures of dispersion : Quartiles
Case 1- Non Grouped data
Example 1 : n is odd (n= 13) Example 2 : n is even (n=8)
1232363445443 5 6 8 10 9 9 4 7
Step1 (ordering): 1 2 2 3 3 3 3 4 4 4 4 5 6 Step1 (ordering): 4 5 6 7 8 9 9 10
Step 2:
Step 2: Q1: n/4 = 8/4 = 2 => Q1 = (x(2) + x(3))/2=
Q1: n/4 = 13/4 = 3.25 => The integer part = (5+6))/2=5.5
of 3.25 is 3 => Q1 = x(3+1) = x(4) = 3 (the 4th n/2 = 4 => Q2 = Me = (x(4) + x(5))/2= (7+
value of ordered data) 8)/2 = 7.5
Q2 : 2n/4 = 13/2 = 6.5 => The integer part 3n/2 = 6 => Q3= (x(6) + x(7))/2= (9+9)/2 = 9
of 6.5 is 6 => Q2 =Me = x(6+1) = x(7) = 3 (the
7th value of ordered data)
Q3 : 3n/4 = 3(13)/4 = 9.75=> The integer
part of 9.75 is 9 => Q3 = x(9+1) = x(10) = 4
53
Measures of dispersion : Quartiles
Case 2- Grouped data
Step 1 : Calculate the increasing cumulative frequency : Ni
Step 2 : Calculate nα (α = ¼; ½; ¾)
If nα belongs the increasing cumulative frequency, we take the value
of X associated with nα
Elsewhere, we take the value of X associated the increasing
cumulative frequency immediately greater than nα
Example 3 : Data in the right Table
Step1: Calculate Ni
Step 2:
Q1: n/4 = 50/4 = 12.5 => Q1 = (x(12) + x(13) )/2 = (2+2)/2 = 2
Q2: n/2 = 50/2 = 25 => Q2 = (x(25) + x(26) )/2 = (3+3)/2 = 3
Q3: 3n/4 = 150/4 = 37.5 => Q3 = (x(37) + x(38) )/2 = (4+4)/2 =
4
54
Measures of dispersion : The Quartiles
Quartiles graphically – Example 3
50
40
3n/4 = 37.5
30
n/2 = 50 /2 = 25
n/4 = 12.5 20
10
Q1 Q2 Q3
55
Measures of dispersion : The Quartiles
Case 2- Grouped data by classes – Continuous Variable
Step 1 : Calculate the increasing cumulative frequency (Ni)
Find the Quartiles Class: Calculate αn : If αn belongs the increasing cumulative frequency, we take
the value of X associated with αn
Elsewhere, we take the value of X associated the increasing cumulative frequency immediately
greater than αn
Step 2:
Classe ni Ni Ni
s
Example 5 : Data in the right Table
[10-20[ 10 10 80
Step1: Find the 1st Quartile class
[20-30[ 25 35 70
n/4 = 80/4 = 20 => Q1 ∊ [20-30[
[30-40[ 15 50 45
Step 2 : Q1 = 20 + [(20-10)/25] × 10 = 24
[40-50[ 15 65 30
[50-60[ 15 80 15
Total 80
14
Measures of dispersion : The Quartiles
Quartiles graphically – Grouped data by classes
Ni
Ni
n/4 = 20
Q1
15
Measures of dispersion : Interquartile Range
Interquartile range : IQR = Q3 – Q1
Interquartile interval : [Q3; Q1]
The dispersion around the median
IQR (50%)
58
Grouped data, Boxplot, Skewness,
Kurtosis, Tranformation of variables,
Means and Variances of two Groups
Grouped data into classes
Given a set of raw or ungrouped data, how would you group that data
into suitable classes that are easy to work with and at the same time
meaningful?
The first step is to determine how many classes you want to have
The second step is to determine the class interval (width)
60
Grouped data: Example
We observed the salary of 12 employees and we obtained :
315; 423; 300; 548; 374; 452; 463; 339; 324; 432; 315; 450
Population: Employees
Sample: 12 Employees
Individual: 1 employee
Variable: Salary
Type: Quantitative continuous
Salary ni
Number of classes: k ≥ ln(12)/ln(2) [300-362[ 5
=> k≥ 3.6 => k =4 [362-424[ 2
Width: a ≥ (548 – 300) /4 = 62 [424-486[ 4
[486-548] 1
Total 12
61
Boxplot
A Boxplot is a standardized way of displaying the distribution of data
based on a five number summary (“minimum”, first quartile (Q1), median,
third quartile (Q3), and “maximum”).
It can tell you about your outliers and what their values are
It can also tell you if your data is symmetrical, how tightly your data is
grouped, and if and how your data is skewed.
Outliers or extreme values:
Minimum: L = Q1-1.5 × IQR
Maximum: U = Q3+1.5 × IQR
62
Boxplot: Example
Suppose that the dataset consists of these hypothetical test scores:
5 39 75 79 85 90 91 93 93 98
Q1 = 75, Q2 = 88, Q3 = 92. IQR = 93 - 75 = 18.
Outliers or extreme values:
Minimum: L = Q1-1.5 × IQR = 75 – 1.5 × (18) = 48
Maximum: U = Q3+1.5 × IQR = = 92+1.5 × (18) = 119
5 39 75 79 85 90 91 93 93 98
48 119
2 outliers
(left side)
63
Skeweness
The asymmetry (skewness) of a distribution of a variable X is
measured using:
the Fisher asymmetry coefficient:
Pearson coefficient
Yule Coefficient
64
Kurtosis
The kurtosis of a distribution indicates how much the distribution is
flat. In other words, it indicates whether a distribution has a peack or not.
The kurtosis is measured using Fisher coefficient defined by
65
Means and Variances of two Groups
We assume that the n observations are divided into two groups:
Group GA and Group GB
Group A Group B
Sample Size nA nB
Mean
Variance VB(X)
VA(X)
The total or the agregated mean
is a weighted average of the two
means groups. This is written as The agregated variance is
follows
66
Means and Variances of two Groups: Example
Student in a class are divided into two groups A and B. The mean of
the age of 12 students in the group A is 23 years old. While the mean of
the age of 15 students in the group B is 25 years old. Compute the mean
of the age of all students of this class.
Group A Group B
Sample Size 12 15
Mean 23 25
67
Transformation of variables
A change of a variable X is a transformation of this variable into a
new variable Y by multiplying X by a constant and/or adding a constant to
the variable X.
yi = axi + b
68