0% found this document useful (0 votes)
19 views68 pages

Comprehensive Guide to Statistics Concepts

The document provides an overview of basic statistical concepts, including definitions, types of data, and methods of data representation. It covers descriptive and inferential statistics, emphasizing the importance of sampling and the distinction between qualitative and quantitative data. Additionally, it discusses various measures of central tendency, such as mean and mode, and their applications in analyzing data.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views68 pages

Comprehensive Guide to Statistics Concepts

The document provides an overview of basic statistical concepts, including definitions, types of data, and methods of data representation. It covers descriptive and inferential statistics, emphasizing the importance of sampling and the distinction between qualitative and quantitative data. Additionally, it discusses various measures of central tendency, such as mean and mode, and their applications in analyzing data.
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Statistics (S1101)

By
Abbas MOURAD
S1100
❑ Chapter 1 : Basic Statistical Concepts
➢ Some definitions

❑ Chapter 2 : Basic Statistical Concepts


➢ Graphical representation of data : Bar chart, Strip chart, Histogram, Boxplot.
➢Numercial representaion of data: Mean, Median, mode, variance, quantiles,
skweness and kurtosis

 Chapter 3: Two-dimensional Descriptive Statistics


➢ Data Representation, Coefficient of correlation and simple linear regression
❑ Chapter 4: Combinatorial analysis & Probability
➢ Permutations, Arrangements and Combinations
➢Basic Concepts of Probability, Conditional probability and Bayes Theorem
2
Chapter 1:
Basic Statistical Concepts
What is Statistics?
Data Why?
1. Collecting Data
Analysis
e.g., Survey
2. Presenting Data
e.g., Charts & Tables
Decision-
3. Characterizing Data
Making
e.g., Mean, Median

© 1984-1994 T/Maker Co.

4
What is Statistics?
❑ Statistics: The science of collecting, describing, and
interpreting data.

Two areas of statistics:


Statistical Methods

Descriptive Statistics
Inferential Statistics

➢ Descriptive Statistics: collection, presentation, and


description of sample data.
➢ Inferential Statistics: making decisions and drawing
conclusions about populations.

5
Descriptive Statistics

1. Involves
• Collecting Data $
• Presenting Data 50
• Characterizing Data
25
2. Purpose
• Describe Data 0
Q1 Q2 Q3 Q4

Mean= 30.5 S2 = 113

6
Inferential Statistics
1. Involves
• Estimation
• Hypothesis
Testing
2. Purpose
Population?
• Make decisions about
population characteristics

7
Population & Sample
Sample 4
Size : n Sample 3
Size : n

Sample 1
Size : n
Population
Sample 2
Size : N Size : n

❑ Understand that statistics can be used to gain information about a


population by examining a sample of the population ; generalizations
about a population from a sample are valid only if the sample is
representative of that population.

8
Simple Random Sample
 Every subset of a specified size n from the
population has an equal chance of being selected

9
Types of Data
Types of
Data

Quantitative Qualitative
Data Data

Quantitative data are measurements that are


recorded on a naturally occurring numerical scale.

Qualitative data are measurements that cannot be


measured on a natural numerical scale; they can only
be classified into one of a group of categories.

10
Type of variables
❑ Example: The table below represents employees of a
communications company.
Number of Age Gender Number of children Level of satisfaction
employees
1 25 M 0 Satisfied
2 35 M 2 Very Satisfied
3 24 F 0 Satisfied
4 52 M 3 Unsatisfied
…. …. …. …. ….
99 48 F 2 Very Satisfied
100 26 F 1 Satisfied

▪ Population: Employees of a communications company


▪ Sample: 100 employees of a communications company
▪Individual: 1 employee
▪ Variables: Age, Gender, Number of children and Level of satisfaction

11
Descriptive Statistics
❑ Two kinds of variables:
➢ Qualitative, or Categorical, Variable: A variable that categorizes
or describes an element of a population.

➢ Quantitative, or Numerical, Variable: A variable that quantifies


an element of a population.

Nominal (Gender with 2 levels: M & F)


Qualitative
Ordinal (Level of satisfaction with 5 levels)

Variable
Discrete (Number of children)
Quantitative
Continuous (Age)

12
Descreptive Statistics
 Nominal Variable: A qualitative variable that categorizes (or
describes, or names) an element of a population.
 Ordinal Variable: A qualitative variable that incorporates an
ordered position, or ranking.
 Discrete Variable: A quantitative variable that can assume a
countable number of values. Intuitively, a discrete variable can
assume values corresponding to isolated points along a line
interval. That is, there is a gap between any two values.
 Continuous Variable: A quantitative variable that can assume
an uncountable number of values. Intuitively, a continuous
variable can assume any value along a line interval, including
every possible value between any two values
13
Chapter 2:
Univariate Statistical Analysis
Basic Notations
k

k k

Xi ni fi pi Ni Ni
[0-10[ 2 2/10 = 0.2 0.2×100 = 20% + 2 8 +2 =10
[10-20[ 4 0.4 40% 2+4 = 6 4+4 =8
[20-30[ 3 0.3 30% 6+3 = 9 1+3 =4
[30-40[ 1 0.1 10% 9+1= 10 1
Total n =10 1 100% ------- -------

15
Descriptive Statistics
 Descriptive Statistics are Used by Researchers to
Report on Populations and Samples

 Types of descriptive statistics:


✓ Organize Data (Qualitative Variable)
▪ Tables
▪Graphs
✓Summarize Data (Quantitative Variable)
▪Central Tendency
▪Variation

16
Representation of qualitative variable
✓Tables
Frequency Distributions
✓Qualitative Variable : (Percentage)
✓Graphs
▪ Measurement scale: Nominal Bar Chart or Pie Chart
✓ Measures of Central
Gender
Tendency
❑ Male Mode
❑ Female
✓Tables
▪ Measurement scale: Ordinal Frequency Distributions
What is your level of satisfaction with (Percentage) and cumulative
public transport in Lebanon? frequency
❑ Very Satisfied ✓Graphs
Bar Chart or Pie Chart
❑Satisfied
✓ Measures of Central
❑ Neither satisfied/unsatisfied Tendency
❑ Unsatisfied Mode and Median
❑ Very Unsatisfied

17
Descriptive Statistics : Qualitative variable
Example : Variable Gender : Bar Chart and/or Pie Chart
Gender ni fi pi Angle = fi × 360°
Male 33 0.236 23.6% 84.96°
Female 107 0.764 76.4% 275.04°
Total n =140 100 100% 360°

Bar Chart Pie Chart


Descriptive Statistics: Quantitative Discrete variable
✓Quantitative Discrete Variable : ✓Measures of Central Tendency:
Mode, median, mean, Variance, ….
✓Graphs : Strip Chart
Number of ni fi Ni
persons
1 5 0.1 5
2 9 0.18 14
3 15 0.3 29
4 10 0.2 39
5 6 0.12 45
6 3 0.06 48
7 0 0 48
8 2 0.04 50
Total 50 1 …

19
Descriptive Statistics: Quantitative Discrete variable
✓Quantitative Discrete Variable : Cumulative diagram or cumulative distribution function

Number of ni fi Ni
persons
1 5 0.1 5
2 9 0.18 14
3 15 0.3 29
4 10 0.2 39
5 6 0.12 45
6 3 0.06 48
7 0 0 48
8 2 0.04 50
Total 50 1 …

20
Quantitative continuous Variable
Histogram
&
Ogive
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Measures of Central Tendency:
Mode, median, mean, Variance, ….
Graphs : Histogram

Same width : Based on ni


Histogram
Different widths : Based on di

22
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Histogram (Same width)

Frequency
Classes ni Width: ai Polygon
[2-6[ 60 6-2 = 4
[6-10[ 80 10-6= 4
[10-14[ 130 14-10= 4
[14-18[ 100 18-14= 4
[18-22[ 30 22-18= 4
Total 400 …..

23
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Histogram (Different widths)

Classe ni Width : di = ni/ai


s ai
[2-3[ 60 3-2 = 1 60/1 = 60
[3-4[ 80 4-3= 1 80/ 1 = 80
[4-8[ 130 8-4= 4 130/4 = 32.5
[8-16[ 100 8 100/8 = 12.5
[16-22[ 30 6 30/6 = 5
Total 400 ….. …..

The Frequency polygon in case of different widths is based on


the greatest common divisor (gcd) of ai :
In our example: gcd(ai ) = 1

24
Descriptive Statistics: Quantitative Cont. variable

The Frequency polygon in case of different widths is based on


the greatest common divisor (gcd) of ai :
In our example: gcd(ai ) = 1

25
Descriptive Statistics: Quantitative Cont. variable

Frequency
Polygon

The Frequency polygon in case of different widths is based on


the greatest common divisor (gcd) of ai :
In our example: gcd(ai ) = 1

26
Descriptive Statistics: Quantitative Cont. variable
Quantitative Continuous Variable : Cumulative Frequencies Curve (Ogive)

Classes ni Ni
[2-3[ 60 60
[3-4[ 80 140
[4-8[ 130 270
[8-16[ 100 370
[16-22[ 30 400
Total 400 …..

27
The Numerical Characteristics
Measures of Central Tendency : The Mean
Measures of Central Tendency : The Mean
Non grouped data:

Example: 2, 4, 5, 1, 3 =>

 Grouped data: The distribution is given as a frequency table

Example:

Xi ni
2 1
3 5
4 2
5 2
Total 10

29
Measures of Central Tendency : The Mean
 Grouped data by classes : the data are grouped into classes, ci is the class midpoint, the mean is

Classes ni Center of
class: ci
[2-6[ 60 6+2 /2= 4
[6-10[ 80 10+6/2= 8
[10-14[ 130 14+10/2= 12
[14-18[ 100 14+18/2= 16
[18-22[ 30 22+18/2= 20
Total 400 …..

30
The Numerical Characteristics
Measures of Central Tendency : The Mode
Measures of Central Tendency : The Mode
 The mode is the most repeated value in a distribution

 Non grouped data


Example 1: 2, 2, 4, 5, 8, 12, 2, 1, 2 => Mode = 2
Example 2: 2, 1, 4, 5, 8, 12, 2, 1, 1 => Mode = 1 & 2
Example 3: 2, 4, 6, 8, 1, 10, 22, 45 => Mode doesn't exist

 Grouped data
Xi ni
2 1
 The most repeated value i.e. the value with highest frequency
3 5
 4 2 => Mode = 3
5 2
Total 10

32
Measures of Central Tendency : The Mode
 The mode is the most repeated value in a distribution
 Grouped data : Graphically
Strip Chart
Xi ni
6
2 1 5
3 5 4
4 2 3

ni
5 2 2

Total 10 1
0
0 1 2 3 4 5 6

Xi

Mode = 3

33
Measures of Central Tendency : The Mode
 Grouped data by classes : In the case of grouped data by classes, the mode is affected by the
width (same or different)
 Same Width:

 Different Widths:

Example (Same Width)


Classes ni Width (ai)
 Step 1: Modal Class: The class with highest frequency
[2-6[ 60 4 => Modal class = [10-14[
[6-10[ 80 4
[10-14[ 130 4 Step 2 :
[14-18[ 100 4
[18-22[ 30 4
Total 400
34
Measures of Central Tendency : The Mode
Example (Same Width) : Graphically

Classes ni Width (ai)


[2-6[ 60 4
[6-10[ 80 4
[10-14[ 130 4
[14-18[ 100 4
[18-22[ 30 4
Total 400

Mode ≈12.5
35
Measures of Central Tendency : The Mode
Example (Different Widths) Step 1: Modal Class: The class with highest density
=> Modal class = [3-4[
Classe ni Width : di = ni/ai
s ai
Step 2 :
[2-3[ 60 3-2 = 1 60/1 = 60
[3-4[ 80 4-3= 1 80/ 1 = 80
[4-8[ 130 8-4= 4 130/4 = 32.5
[8-16[ 100 8 100/8 = 12.5
[16-22[ 30 6 30/6 = 5
Total 400 ….. …..

 Graphically

Mode ≈3.3
The Numerical Characteristics
Measures of Central Tendency : The Median
Measures of Central Tendency : The Median
The median is the middle value in the distribution when the numbers
have been arranged into increasing order
Case 1- Ungrouped data
Step 1 : Order observations from smallest to largest : x(1) ≤ x(2) ≤ … ≤ x(2)
 Step 2 : The median is

Example 1 : n is odd (n= 13) Example 2 : n is even (n=8)

1232363445443 5 6 8 10 9 9 4 7

Step1 (ordering): 1 2 2 3 3 3 3 4 4 4 4 5 6 Step1 (ordering): 4 5 6 7 8 9 9 10


Step 2: n is even =>
Step 2: n is odd => The integer part of [13/2 = 6.5] Me = (x(4) + x(5))/2= (7+ 8)/2 = 7.5
is 6 => Me = x(6+1) = x(7) = 3 (the 7th value of ordered (the median is the value between the 4th and 5th
data) observation of ordered data)

38
Measures of Central Tendency : The Median
Case 2- Grouped data
Step 1 : Calculate the increasing cumulative frequency : Ni
 Step 2 : Calculate n/2 :
 If n/2 belongs the increasing cumulative frequency, we take the value
of X associated with n/2
 Elsewhere, we take the value of X associated the increasing
cumulative frequency immediately greater than n/2

Example 3 : Data in the right Table

Step1: Calculate Ni

Step 2: n/2 = 50/2 = 25 =>


Me = (x(25) + x(26) )/2 = (3+3) /2 = 3

39
Measures of Central Tendency : The Median
 Median graphically – Example 3

50

40

30
 n/2 = 50 /2 = 25

20

10

Median = 3

40
Measures of Central Tendency : The Median
Example 4 : Data in the right Table xi ni Ni
Step1: Calculate Ni 10 10 10
Step 2: n/2 = 24/2 = 12 => 11 2 12
Me = (x(12) + x(13) )/2 = (11+ 12) /2 = 11.5
12 8 20
13 4 24
Total 24

25

20

15
n/2 = 24/2 = 12
10

10 11 11.5 12 13

41
Measures of Central Tendency : The Median
Case 2- Grouped data by classes – Continuous Variable
Step 1 : Calculate the increasing cumulative frequency (Ni)
 Find the Median Class: Calculate n/2 : If n/2 belongs the increasing cumulative frequency, we take
the value of X associated with n/2
 Elsewhere, we take the value of X associated the increasing cumulative frequency immediately
greater than n/2

 Step 2:

Classe ni Ni Ni
s
Example 5 : Data in the right Table
[10-20[ 10 10 80
Step1: Find the median class
[20-30[ 25 35 70
n/2 = 80/2 = 40 => Me ∊ [30-40[
[30-40[ 15 50 45
 Step 2 :
[40-50[ 15 65 30
[50-60[ 15 80 15
Total 80

14
Measures of Central Tendency : The Median
 Median graphically – Grouped data by classes

Ni

Median
15
Measures of Central Tendency : Properties
 Properties of the mean:
 The extreme values, called outliers, in the data set have an effect on the
mean. (When outliers are present, the mean is “pulled” in the direction of
the outlier)
 The mean obtained after grouping data into classes differs from the mean
obtained from ungrouped data

 The median:
 Unlike the mean, extreme values (outliers) have no effect on the median
 Like the mean and the mode, the median obtained after grouping data
into classes differs from the mean obtained from ungrouped data

 Properties of the mode:


 is not affected by extreme values
 is very sensitive to variations in class width (amplitude)
 may not be unique ; may not exist
 Like the mean, the mode obtained after grouping data into classes differs
from the mode obtained from ungrouped data
The Numerical Characteristics
Measures of dispersion:
The range, the variance and the Inter-
quartile range
Measures of dispersion

 Range

Variance and Standard deviation

Coefficient of Variation

Quantiles (Quartiles, Deciles and percentiles)

46
Measures of dispersion : The Range
 Difference between the maximum and the minimum of the set of the
data : Range = Highest value – Lowest value = X(n) – X(1)

 The larger the range, the more dispersion in a set of data

 Affected by the extreme values

Example 2
Example 1 : Classes ni
1232363445443 [10-20[ 10
[20-30[ 25
Range = 6 – 1 = 5
[30-40[ 15
[40-50[ 15
[50-60[ 15
Total 80
Range = 60 – 10 = 50

47
Measures of dispersion : The Variance
 Case 1- Non Grouped data

 Case 2- Grouped data

 Case 3- Grouped data into classes

48
Measures of dispersion : The Standard deviation

 Properties of variance and standard deviation


 The higher the variance (standard deviation), the more spread out the data:
The values become more dispersed

 95% observation belong to

 If the variance is equal to zero, then the values are equal

 The variance (resp. standard deviation) is affected by the extreme values

 The variance obtained after grouping data into classes differs from the
variance of data before grouping The
dispersion
around the
mean

49
Measures of dispersion : Coefficient of variation

 Is a measurement without unit and defined by

 Expressed as a percentage

 The higher the coefficient of variation, the greater the level of


dispersion

 It allows for comparison between distributions of values whose scales


of measurement are not comparable

Example :

10 individuals
Age expressed in years
 Salary expressed in dollars

Age is more
dispersed than
the salary

50
Measures of dispersion : Quartiles

 Values that divide a ranked data into several equal parts

 Quartiles : divide a distribution into four equal parts : Q1, Q2, Q3

 Deciles : divide a distribution into 10 equal parts

 Percentiles : Percentiles are quantiles that divide a distribution


into 100 equal parts

51
Measures of dispersion : Quartiles
 Quartiles: Divide the distribution into 4 equal parts. There are 3
quartiles: Q1 (25% accumulated) , Q2 = Me (50% accumulated), Q3 (75%
accumulated).

 Case 1- Non Grouped data

52
Measures of dispersion : Quartiles
 Case 1- Non Grouped data

Example 1 : n is odd (n= 13) Example 2 : n is even (n=8)

1232363445443 5 6 8 10 9 9 4 7

Step1 (ordering): 1 2 2 3 3 3 3 4 4 4 4 5 6 Step1 (ordering): 4 5 6 7 8 9 9 10


Step 2:
Step 2:  Q1: n/4 = 8/4 = 2 => Q1 = (x(2) + x(3))/2=
 Q1: n/4 = 13/4 = 3.25 => The integer part = (5+6))/2=5.5
of 3.25 is 3 => Q1 = x(3+1) = x(4) = 3 (the 4th  n/2 = 4 => Q2 = Me = (x(4) + x(5))/2= (7+
value of ordered data) 8)/2 = 7.5
 Q2 : 2n/4 = 13/2 = 6.5 => The integer part  3n/2 = 6 => Q3= (x(6) + x(7))/2= (9+9)/2 = 9
of 6.5 is 6 => Q2 =Me = x(6+1) = x(7) = 3 (the
7th value of ordered data)
 Q3 : 3n/4 = 3(13)/4 = 9.75=> The integer
part of 9.75 is 9 => Q3 = x(9+1) = x(10) = 4

53
Measures of dispersion : Quartiles
 Case 2- Grouped data

Step 1 : Calculate the increasing cumulative frequency : Ni


 Step 2 : Calculate nα (α = ¼; ½; ¾)
 If nα belongs the increasing cumulative frequency, we take the value
of X associated with nα
 Elsewhere, we take the value of X associated the increasing
cumulative frequency immediately greater than nα

Example 3 : Data in the right Table

Step1: Calculate Ni

Step 2:
 Q1: n/4 = 50/4 = 12.5 => Q1 = (x(12) + x(13) )/2 = (2+2)/2 = 2
 Q2: n/2 = 50/2 = 25 => Q2 = (x(25) + x(26) )/2 = (3+3)/2 = 3
 Q3: 3n/4 = 150/4 = 37.5 => Q3 = (x(37) + x(38) )/2 = (4+4)/2 =
4

54
Measures of dispersion : The Quartiles
 Quartiles graphically – Example 3

50

40
 3n/4 = 37.5
30
 n/2 = 50 /2 = 25

 n/4 = 12.5 20

10

Q1 Q2 Q3

55
Measures of dispersion : The Quartiles
Case 2- Grouped data by classes – Continuous Variable
Step 1 : Calculate the increasing cumulative frequency (Ni)
 Find the Quartiles Class: Calculate αn : If αn belongs the increasing cumulative frequency, we take
the value of X associated with αn
 Elsewhere, we take the value of X associated the increasing cumulative frequency immediately
greater than αn

 Step 2:

Classe ni Ni Ni
s
Example 5 : Data in the right Table
[10-20[ 10 10 80
Step1: Find the 1st Quartile class
[20-30[ 25 35 70
n/4 = 80/4 = 20 => Q1 ∊ [20-30[
[30-40[ 15 50 45
 Step 2 : Q1 = 20 + [(20-10)/25] × 10 = 24
[40-50[ 15 65 30
[50-60[ 15 80 15
Total 80

14
Measures of dispersion : The Quartiles
 Quartiles graphically – Grouped data by classes

Ni
Ni

n/4 = 20

Q1

15
Measures of dispersion : Interquartile Range
 Interquartile range : IQR = Q3 – Q1

 Interquartile interval : [Q3; Q1]

 The dispersion around the median

IQR (50%)

58
Grouped data, Boxplot, Skewness,
Kurtosis, Tranformation of variables,
Means and Variances of two Groups
Grouped data into classes
 Given a set of raw or ungrouped data, how would you group that data
into suitable classes that are easy to work with and at the same time
meaningful?

 The first step is to determine how many classes you want to have

 The second step is to determine the class interval (width)

60
Grouped data: Example
 We observed the salary of 12 employees and we obtained :
315; 423; 300; 548; 374; 452; 463; 339; 324; 432; 315; 450
 Population: Employees
Sample: 12 Employees
Individual: 1 employee
 Variable: Salary
Type: Quantitative continuous
Salary ni
 Number of classes: k ≥ ln(12)/ln(2) [300-362[ 5

=> k≥ 3.6 => k =4 [362-424[ 2

 Width: a ≥ (548 – 300) /4 = 62 [424-486[ 4

[486-548] 1

Total 12

61
Boxplot
 A Boxplot is a standardized way of displaying the distribution of data
based on a five number summary (“minimum”, first quartile (Q1), median,
third quartile (Q3), and “maximum”).
 It can tell you about your outliers and what their values are
 It can also tell you if your data is symmetrical, how tightly your data is
grouped, and if and how your data is skewed.

 Outliers or extreme values:


 Minimum: L = Q1-1.5 × IQR
 Maximum: U = Q3+1.5 × IQR
62
Boxplot: Example
 Suppose that the dataset consists of these hypothetical test scores:
5 39 75 79 85 90 91 93 93 98
 Q1 = 75, Q2 = 88, Q3 = 92. IQR = 93 - 75 = 18.
 Outliers or extreme values:
 Minimum: L = Q1-1.5 × IQR = 75 – 1.5 × (18) = 48
 Maximum: U = Q3+1.5 × IQR = = 92+1.5 × (18) = 119

5 39 75 79 85 90 91 93 93 98
48 119
2 outliers
(left side)

63
Skeweness
 The asymmetry (skewness) of a distribution of a variable X is
measured using:
 the Fisher asymmetry coefficient:

 Pearson coefficient

 Yule Coefficient

64
Kurtosis
The kurtosis of a distribution indicates how much the distribution is
flat. In other words, it indicates whether a distribution has a peack or not.
The kurtosis is measured using Fisher coefficient defined by

65
Means and Variances of two Groups
 We assume that the n observations are divided into two groups:
Group GA and Group GB

Group A Group B

Sample Size nA nB
Mean
Variance VB(X)
VA(X)

 The total or the agregated mean


is a weighted average of the two
means groups. This is written as  The agregated variance is
follows

66
Means and Variances of two Groups: Example
 Student in a class are divided into two groups A and B. The mean of
the age of 12 students in the group A is 23 years old. While the mean of
the age of 15 students in the group B is 25 years old. Compute the mean
of the age of all students of this class.

Group A Group B

Sample Size 12 15
Mean 23 25

67
Transformation of variables
 A change of a variable X is a transformation of this variable into a
new variable Y by multiplying X by a constant and/or adding a constant to
the variable X.

yi = axi + b

68

You might also like