Exploratory Data Analysis
TABULAR AND GRAPHICAL METHODS
Outline
Graphical Methods for Quantitative Variables
Graphical Methods for Qualitative Variables
Graphical Methods for Quantitative
Variables
CASE 1: ADVERTISING AND SALES
Case 1: Advertising and Sales
Overview Challenge Objective
FOOD4U is a major food and The company invests a lot for To figure out the effectiveness
beverage company. advertising across different of advertising for the product
It sells a number of different media. across several markets.
products across different However, the company is not
markets. sure of the utility of advertising.
It uses “advertising” heavily to
promote the products
Advertising Data Set
The Advertising data set consists of the sales (in
thousands of units) of a particular product in 200
different markets.
It also contains the advertising budgets (in thousands
of dollars) for the product in each of the markets for
three different media: TV, Radio, and Newspaper
Advertising Data Set
Market ID TV Radio Newspaper Sales
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9
6 8.7 48.9 75 7.2
7 57.5 32.8 23.5 11.8
8 120.2 19.6 11.6 13.2
9 8.6 2.1 1 4.8
Important Questions for an Effective Market
Plan
1
Is there any relationship between advertising budget and sales?
2
How strong is the relationship between advertising budget and sales?
3
Which media contribute to sales?
4
How accurately can we predict the future sales?
Graphical Methods for Quantitative
Variables
Univariate Analysis
Bivariate Analysis
Graphical Methods for Quantitative
Variables
Univariate Analysis
Bivariate Analysis
Univariate Analysis
DISTRIBUTION OF SALES
Summarizing Quantitative Data
• A frequency distribution for quantitative data groups data into
intervals called classes and records the number of observations
that fall into each class.
• Guidelines when constructing frequency distribution:
✓ Classes are mutually exclusive.
✓ Classes are exhaustive.
Summarizing Quantitative Data
• The number of classes usually ranges from 5 to 20. This is a guideline, not an
absolute rule.
• Approximating the class width:
Largest value − Smallest value
Number of classes
Class Interval Frequency
0-5 3
5-10 42
Frequency Table 10-15 80
15-20 44
20-25 25
25-30 6
Histogram
90
80
70
60
Histogram for 50
Frequency
Sales 40
30
20
10
0
5 10 15 20 25 30
Sales
Summarizing Quantitative Data
• A histogram is a visual representation of a frequency or a relative frequency
distribution.
✓ Bar height represents the respective class frequency (or relative
frequency).
✓ Bar width represents the class width.
Summarizing Quantitative Data
• Shape of Distribution: typically symmetric or skewed
✓ Symmetric—mirror image on both sides of its center.
Symmetric Distribution
Summarizing Quantitative
Data
• Skewed distribution
✓ Positively skewed - data form a long,
narrow tail to the right.
✓ Negatively skewed - data form a long,
narrow tail to the left.
An Example of Skewed Distribution
2019 IPL Player
Salary
Distribution
Graphical Methods for Quantitative
Variables
Univariate Analysis
Bivariate Analysis
Bivariate Analysis
RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES
Scatter Plot
• Linear relationship: upward or downward-
sloping trend of the data.
• Positive linear relationship (shown here): as
x increases, so does y.
• Negative linear relationship: as x increases, y
decreases.
Scatter Plot
• Curvilinear relationship
✓ As x increases, y increases at an increasing
(or decreasing) rate.
✓ As x increases y decreases, at an
increasing (or decreasing) rate.
Scatter Plot
• No relationship: data are randomly
scattered with no discernible pattern.
• In this scatterplot, there is no apparent
relationship between x and y.
Plot of
Advertising
Data Set
Graphical Methods for Qualitative
Variables
CASE 2: FINDING RIGHT CUSTOMERS FOR ROYAL CREDIT CARD
Case 2: Finding Right Customers at Easy
Credit
Overview Challenge Objective
Royal Bank is a well-known A number of Royal credit card Royal Bank wants to re-examine
name in India. holders are students. the strategy of issuing credit
It provides easy access to its . cards to students.
credit cards.
Default Data Set
A data set on ten thousand customers.
Variables
• Default: A factor with levels “No” and “Yes” indicating whether the customer
defaulted on their debt.
• Student: A factor with levels “No” and “Yes” indicating whether the customer is a
student.
• Balance: The average balance that the customer has remaining on their credit card
after making their monthly payment.
• Income: Income of customer.
Default Data Set
ID Balance Income Student Default
1 729.52 44361.63 No No
2 817.18 12106.13 Yes No
3 1073.55 31767.14 No No
4 529.25 35704.49 No No
5 785.66 38463.5 No No
6 919.59 7491.559 Yes No
7 825.5 24905.23 No No
8 808.67 17600.45 Yes No
9 1161.06 37468.53 No No
Inference
• Relationship between the output
(i.e., Default) and the input variables
(i.e., balance, income, student)
Objective
Prediction
• Whether an individual will default on
his or her credit card payment.
Graphical Methods for Qualitative Variables
Univariate Analysis
Bivariate Analysis
Graphical Methods for Qualitative Variables
Univariate Analysis
Bivariate Analysis
Univariate Analysis
DISTRIBUTION OF DEFAULT
Default Status
12000
Default Status Number of
Customers
10000
No 9667
8000
Yes 333
6000
Total 10000
4000
2000
0
No Yes
Graphical Methods for Qualitative Variables
Univariate Analysis
Bivariate Analysis
Bivariate Analysis
RELATIONSHIP BETWEEN TWO QUALITATIVE VARIABLES
Student Status and Default Status
Default Count of student
Student No Yes
8000
Status Total 7000
6850
No 6850 206 7056 6000
Yes 2817 127 2944
Number of Customers
5000
Total 9667 333 10000 4000
No
3000 2817 Yes
2000
1000
206 127
0
No Yes
Student Status
Student Status and Default Status
5
4.5
3.5
Student Default Rate
Default Rate
3
2.5
No 2.92% 2
1.5
Yes 4.31% 1
0.5
0
No Yes
Student Status
Tabular and Graphical Methods
SUMMARY
Summary
Interval Nominal
Data Data
Histogram Frequency and
Relative Frequency
Univariate Analysis
Tables, Bar and Pie
Charts
Scatter Diagram Multiple Bar Charts
Bivariate Analysis
Reading Material
• Chapter 2 of Jaggia and Kelly
✓Introductory Case: House Prices in Southern California
✓ Sections 2.1, 2.2, 2.4
• Exercises: 7, 8, 20, 49, 56, Case Study 2.2