0% found this document useful (0 votes)
9 views60 pages

Discriminant Analysis in Marketing Analytics

The document discusses the application of discriminant analysis in marketing analytics for classification and prediction, particularly in distinguishing between different groups based on their characteristics. It outlines the methodology, including the construction of a discriminant function, the importance of independent variables, and the statistical significance of the model. A case study is presented where a bank uses discriminant analysis to classify credit card applicants as low or high risk based on their age, income, and years of marriage.

Uploaded by

disha mahesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views60 pages

Discriminant Analysis in Marketing Analytics

The document discusses the application of discriminant analysis in marketing analytics for classification and prediction, particularly in distinguishing between different groups based on their characteristics. It outlines the methodology, including the construction of a discriminant function, the importance of independent variables, and the statistical significance of the model. A case study is presented where a bank uses discriminant analysis to classify credit card applicants as low or high risk based on their age, income, and years of marriage.

Uploaded by

disha mahesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

MARKETING ANALYTICS

Brijesh Singh
Department of Management Studies

1
MARKETING ANALYTICS
Discriminant Analysis for
Classification and Prediction

Brijesh Singh
Department of Management Studies
2
MARKETING ANALYTICS
Application Areas

1. The major application area for this technique is where


we want to be able to distinguish between two or three
sets of objects or people, based on the knowledge of
some of their characteristics.

2. Examples include the selection process for a job, the


admission process of an educational programme in a
college, or dividing a group of people into potential
buyers and non-buyers.
MARKETING ANALYTICS
Application Areas

3. Discriminant analysis can be, and is in fact used, by credit


rating agencies to rate individuals, to classify them into good
lending risks or bad lending risks. The detailed example
discussed later tells you how to do that.

4. To summarise, we can use linear discriminant analysis


when we have to classify objects into two or more groups
based on the knowledge of some variables (characteristics)
related to them. Typically, these groups would be users-non-
users, potentially successful salesman – potentially
unsuccessful salesman, high risk – low risk consumer, or on
similar lines.
MARKETING ANALYTICS
Methods, Data etc.

1. Discriminant analysis is very similar to the multiple


regression technique. The form of the equation in a
two-variable discriminant analysis is:
Y = a + k1 x1 + k2 x2

2. This is called the discriminant function. Also, like in a


regression analysis, y is the dependent variable and x1
and x2 are independent variables. k1 and k2 are the
coefficients of the independent variables, and a is a
constant. In practice, there may be any number of x
variables.
MARKETING ANALYTICS
Methods, Data etc.

3. Please note that Y in this case is a categorical variable


(unlike in regression analysis, where it is continuous). x1
and x2 are however, continuous (metric) variables. k1 and
k2 are determined by appropriate algorithms in the
computer package used, but the underlying objective is
that these two coefficients should maximise the
separation or differences between the two groups of the y
variable.

4. Y will have 2 possible values in a 2 group discriminant


analysis, and 3 values in a 3 group discriminant analysis,
and so on.
MARKETING ANALYTICS
Methods, Data etc.

5. K1 and K2 are also called the unstandardised


discriminant function coefficients

6. As mentioned above, Y is a classification into 2 or more


groups and therefore, a ‘grouping’ variable, in the
terminology of discriminant analysis. That is, groups are
formed on the basis of existing data, and coded as 1 and 2.

7. The independent (x) variables are continuous scale


variables, and used as predictors of the group to which the
objects will belong. Therefore, to be able to use
discriminant analysis, we need to have some data on y and
the x variables from experience and / or past records.
MARKETING ANALYTICS

Building a Model for Prediction/Classification


Assuming we have data on both the y and x variables of interest,
we estimate the coefficients of the model which is a linear
equation of the form shown earlier, and use the coefficients to
calculate the y value (discriminant score) – for any new data
points that we want to classify into one of the groups. A decision
rule is formulated for this process – to determine the cut off
score, which is usually the midpoint of the mean discriminant
scores of the two groups.
Accuracy of Classification:
Then, the classification of the existing data points is done using
the equation, and the accuracy of the model is determined. This
output is given by the classification matrix (also called the
confusion matrix), which tells us what percentage of the existing
data points is correctly classified by this model.
MARKETING ANALYTICS

Stepwise / Fixed Model:


Just as in regression, we have the option of entering one
variable at a time (Stepwise) into the discriminant equation, or
entering all variables which we plan to use. Depending on the
correlations between the independent variables, and the
objective of the study (exploratory or predictive /
confirmatory), the choice is left to the student.
MARKETING ANALYTICS
Relative Importance of Independent Variables

1. Suppose we have two independent variables, x1 and


x2. How do we know which one is more important in
discriminating between groups?

2. The coefficients of x1 and x2 are the ones which


provide the answer, but not the raw (unstandardised)
coefficients. To overcome the problem of different
measurement units, we must obtain standardised
discriminant coefficients. These are available from the
computer output.

3. The higher the standardised discriminant coefficient


of a variable, the higher its discriminating power.
MARKETING ANALYTICS
A Priori Probability of Classification into Groups

The discriminant analysis algorithm requires us to assign an a


priori (before analysis) probability of a given case belonging to
one of the groups. There are two ways of doing this.
•We can assign an equal probability of assignment to all
groups. Thus, in a 2 group discriminant analysis, we can
assign 0.5 as the probability of a case being assigned to
any group.
•We can formulate any other rule for the assignment of
probabilities. For example, the probabilities could
proportional to the group size in the sample data. If two
thirds of the sample is in one group, the a priori
probability of a case being in that group would be 0.66
(two thirds).
MARKETING ANALYTICS
Some Key Statistics
MARKETING ANALYTICS
Some Key Statistics (contd..)
MARKETING ANALYTICS
Some Key Statistics (contd..)
MARKETING ANALYTICS
Steps Involved in Conducting Discriminant Analysis
MARKETING ANALYTICS
Conducting Discriminant Analysis in SPSS
MARKETING ANALYTICS
Case Study

We will turn now to a complete worked example which will clarify


many of the concepts explained earlier. We will begin with the
problem statement and input data.
Case
Suppose State Bank of Bhubaneswar wants to start credit card
division. They want to use discriminant analysis and set up a system
to screen applicants and classify them as either ‘low risk’ or ‘high
risk’ (risk of default on credit card bill payments), based on
information collected from their applications for a credit card.

Suppose SBB has managed to get from SBI, its sister bank, some data
on SBI’s credit card holders who turned out to be ‘low risk’ (no
default) and ‘high risk’ (defaulting on payments) customers. These
data on 18 customers are given in fig. 1.
MARKETING ANALYTICS
Slide 7
Table Fig. 1

1 2 3 4
RISKLOHI AGE INCOME YRSMARID
1 1 35 40000 8
2 1 33 45000 6
3 1 29 36000 5
4 2 22 32000 0
5 2 26 30000 1
6 1 28 35000 6
7 2 30 31000 7
8 2 23 27000 2
9 1 32 48000 6
10 2 24 12000 4
11 2 26 15000 3
12 1 38 25000 7
13 1 40 20000 5
14 2 32 18000 4
15 1 36 24000 3
16 2 31 17000 5
17 2 28 14000 3
18 1 33 18000 6
MARKETING ANALYTICS
Case Study

We will perform a discriminant analysis and advise SBB on


how to set up its system to screen potential good customers
(low risk) from bad customers (high risk). In particular, we will
build a discriminant function (model) and find out
•The percentage of customers that it is able to classify
correctly.
•Statistical significance of the discriminant function.
•Which variables (age, income, or years of marriage) are
relatively better in discriminating between ‘low’ and
‘high’ risk applicants.
•How to classify a new credit card applicant into one of
the two groups – ‘low risk’ or ‘high risk’, by building a
decision rule and a cut off score.
MARKETING ANALYTICS
Interpretation
Input Data are given in fig. 1.
Interpretation of Computer Output: Fig. 3 : Classification Matrix

We will now find answers to all the four questions we STAT. Classification Matrix ([Link])
have raised earlier. DISCRIM. Rows: Observed classifications
ANALYSIS Columns: Predicted classifications
Q1. How good is the Model? How many of the 18
Group Percent G_1 (Predicted) G_2
data points does it classify correctly?
Correct P=.50000 (Predicted)
To answer this question, we look at the computer P=.50000
output labelled fig. 3. This is a part of the discriminant G1 100.0000 9 0
analysis output from any computer package such as (Observed) 88.8889 1 8
SPSS, SYSTAT, STATISTICA, SAS etc. (there could be G2
minor variations in the exact numbers obtained, and
(Observed)
Total 94.4444 10 8
major variations could occur if options chosen by the
student are different. For example, if a priori
probabilities chosen for the classification into the two
groups are equal, as we have assumed while
generating this output, then you will very likely see
similar numbers in your output).
MARKETING ANALYTICS
Interpretation

This output (fig. 3) is called the classification matrix (also known as the confusion
matrix), and it indicates that the discriminant function we have obtained is able to
classify 94.44 percent of the 18 objects correctly. This figure is in the “percent
correct” column of the classification matrix. More specifically, it also says that out
of 10 cases predicted to be in group 1, 9 were observed to be in group 1 and 1 in
Group 2, (from column G-1). Similarly, from the column G-2, we understand that
our of 8 cases predicted to be in group 2, all 8 were found to be in group 2. Thus,
on the whole, only 1 case out of 18 was misclassified by the discriminant model,
thus giving us a classification (or prediction) accuracy level of (18-1)/18, or 94.444
percent.

As mentioned earlier, this level of accuracy may not hold for all future classification
of new cases. But it is still a pointer towards the model being a good one, assuming
the input data were relevant and scientifically collected. There are ways of
checking the validity of the model, but these will be discussed separately.
MARKETING ANALYTICS
Statistical Significance

Q2. How significant, statistically speaking, is the discriminant function?

This question is answered by looking at the Wilks’ Lambda and the probability
value for the F test given in the computer output, as a part of fig. 3.(shown
below)

Discriminant Function Analysis Results


Number of variables in the model: 3
Wilks’ Lambda: .3188764 approx. F (3, 14) = 9.968056 p < .00089

Wilk’s lambda tests suggests how well each level of independent


variable contributes to the model. The scale ranges from 0 to 1, where 0
means total discrimination, and 1 means no discrimination
The value of Wilks’ Lamba is 0.318. This value is between 0 and 1, and a low
value (closer to 0) indicates better discriminating power of the model.
Thus, 0.318 is an indicator of the model being good. The probability value of
the F test indicates that the discrimination between the two groups is highly
significant. This is because p<.00089, which indicates that the F test would be
significant at a confidence level of upto (1 - .00089) x 100 or (.99911) 100 or
99.91.
MARKETING ANALYTICS
Variables
Slide 12 Importance
Q3. We have 3 independent (or predictor) variables – Age, Income and No. of Years Married for. Which of these is
a better predictor of a person being a low credit risk or high credit risk?

To answer this question, we look at the standardised coefficients in the output. These are given in fig. 5 (shown
below).
Fig. 5.

STAT. Standardized Coefficients


DISCRIM. ([Link]) for Canonical
ANALYSIS Variables
Variable Root 1
AGE .923955
INCOME .774780
YRSMARID .151298
Eigenval 2.136012
[Link] 1.000000

This output shows that Age is the best predictor, with the coefficient of 0.92, followed by Income, with a
coefficient of 0.77, Years of Marriage is the last, with a coefficient of 0.15, Please recall that the absolute value of
the standardised coefficient of each variable indicates its relative importance.
MARKETING ANALYTICS
Classification

Q4. How do we classify a new credit card applicant into either a ‘high risk’ or ‘low risk’
category, and make a decision on accepting or refusing him a credit card?

This is the most important question to be answered. Please remember why we started out
with the discriminant analysis in this problem. State Bank of Bhubaneswar wished to have
a decision model for screening credit card applicants.

The way to do this is to use the outputs in fig. 4 (Raw or unstandardised coefficients in the
discriminant function) and fig. 6 (Means of canonical variables). Fig. 6, the means of
canonical variables, gives us the new means for the transformed group centroids.
Fig. 6.
STAT. Means of Canonical Variables
DISCRIM. ([Link])
ANALYSIS
Group Root 1
G_1:1 -1.37793
G_2:2 1.37792
MARKETING ANALYTICS
Classification

Thus, the new mean for group 1 (low risk) is


1.37793, and the new mean for group 2 (high risk)
is - 1.37792. This means that the midpoint of these
two is 0. This is clear when we plot the two means
on a straight line, and locate their midpoint, as
shown below-

-1.37 0 +1.37
Mean of Group1 Mean of Group2
(High Risk) (Low Risk)
MARKETING ANALYTICS
Classification

This also gives us a decision rule for classifying any new case. If the
discriminant score of an applicant falls to the right of the midpoint, we
classify him as ‘high risk’, and if the discriminant score of an applicant
falls to the left of the midpoint, we classify him as ‘low risk’. In this
case, the midpoint is 0. Therefore, any positive (greater than 0) value of
the discriminant score will lead to classification as ‘high risk’, and any
negative (less than 0) value of the discriminant score will lead to
classification as ‘low risk’. But how do we compute the discriminant
scores of an applicant?

We use the applicant’s Age, Income and Years of Marriage (from his
application) and plug these into the unstandardised discriminant
function. This gives us his discriminant score.
MARKETING ANALYTICS
Model

STAT. Raw Coefficients ([Link]) for


DISCRIM. Canonical Variables
ANALYSIS
Variable Root 1
AGE .24560
INCOME .00008
YRSMARID .08465
Constant 10.00335
Eigenval 2.13601
[Link] 1.00000
From Fig. 4 (reproduced above), the unstandardised (or raw) discriminant function is
Y = -10.0036 + Age (.24560) + Income (.00008)
+ Yrs. Married (.08465)
Where y would give us the discriminant score of any person whose Age, Income and Yrs. Married were
known.
MARKETING ANALYTICS
Model

Let us take an example of a credit card application to SBB who is aged 40, has an
income of Rs. 25,000 per month and has been married for 15 years. Plugging these
values into the discriminant function or model above, we find his discriminant score y
to be

-10.0036 + 40 (.24560) + 25000 (.00008)


+15 (.08465), which is
= -10.0036 +9.824 + 2 + 1.26975
= 3.09015

According to our decision rule, any discriminant score to the right of the midpoint of 0
leads to a classification in the low risk group. Therefore, we should give this person a
credit card, as he is a low risk customer. The same process is to be followed for any
new applicant. If his discriminant score is to the left of the midpoint of 0, he should be
denied a credit card, as he is a ‘high risk’ customer.

We have completed answering the four questions raised by State Bank of


Bhubaneswar.
APPLIED MARKETING RESEARCH
Logistic Regression for
Classification and Prediction

Brijesh Singh
Department of Management Studies
29
MARKETING ANALYTICS
Introduction

• Logistic Regression is used to distinguish between two or more


groups.

• Typical application areas are cases where one wishes to predict the
likelihood of an entity belonging to one group or another, such as in
response to a marketing effort (likelihood of purchase/non-
purchase), creditworthiness (high/low risk of default), insurance
(high/low risk of accident claim)

• Similar to Discriminant Analysis in application

30
MARKETING ANALYTICS
How it is Different

• Unlike Multiple Linear Regression or Linear Discriminant Analysis,


Logistic Regression fits an S-shaped curve to the data.

• This curved relationship ensures that the predicted values are


always between 0 and 1.

31
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

• A binomial logistic regression (often referred to simply as logistic regression),


predicts the probability that an observation falls into one of two categories of a
dichotomous dependent variable based on one or more independent variables
that can be either continuous or categorical.

• For example, you could use binomial logistic regression to understand whether
exam performance can be predicted based on revision time, test anxiety and
lecture attendance (i.e., where the dependent variable is "exam performance",
measured on a dichotomous scale – "passed" or "failed" – and you have three
independent variables: "revision time", "test anxiety" and "lecture attendance").
Alternately, you could use binomial logistic regression to understand whether
drug use can be predicted based on prior criminal convictions, drug use amongst
friends, income, age and gender (i.e., where the dependent variable is "drug use",
measured on a dichotomous scale – "yes" or "no" – and you have five
independent variables: "prior criminal convictions", "drug use amongst friends",
"income", "age" and "gender").
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
Assumptions
When you choose to analyse your data using binomial logistic regression,
part of the process involves checking to make sure that the data you want
to analyse can actually be analysed using a binomial logistic regression.
You need to do this because it is only appropriate to use a binomial
logistic regression if your data "passes" these assumptions that are
required for binomial logistic regression to give you a valid result.
oAssumption #1: Your dependent variable should be measured on a
dichotomous scale. Examples of dichotomous variables include gender
(two groups: "males" and "females"), presence of heart disease (two
groups: "yes" and "no"), personality type (two groups: "introversion" or
"extroversion"), body composition (two groups: "obese" or "not obese"),
and so forth. However, if your dependent variable was not measured on a
dichotomous scale, but a continuous scale instead, you will need to carry
out multiple regression
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

Assumption #2: You have one or more independent variables, which


can be either continuous (i.e., an interval or ratio variable) or
categorical (i.e., an ordinal or nominal variable). Examples of
continuous variables include revision time (measured in hours),
intelligence (measured using IQ score), exam performance (measured
from 0 to 100), weight (measured in kg), and so forth. Examples of
nominal variables include gender (e.g., 2 groups: male and female),
profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist),
and so forth.

Assumption #3: You should have independence of observations and


the dependent variable should have mutually exclusive and
exhaustive categories.
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
Example

A health researcher wants to be able to predict whether the "incidence of heart disease" can be
predicted based on "age", "weight", "gender" and "VO2max" (i.e., where VO2max refers to
maximal aerobic capacity, an indicator of fitness and health). To this end, the researcher
recruited 100 participants to perform a maximum VO2max test as well as recording their age,
weight and gender. The participants were also evaluated for the presence of heart disease. A
binomial logistic regression was then run to determine whether the presence of heart disease
could be predicted from their VO2max, age, weight and gender.
Setup in SPSS Statistics

In this example, there are six variables: (1) heart_disease , which is whether the participant has
heart disease: "yes" or "no" (i.e., the dependent variable);
(2) VO2max , which is the maximal aerobic capacity; (3) age , which is the participant's age;
(4) weight, which is the participant's weight (technically, it is their 'mass'); and (5) gender , which
is the participant's gender (i.e., the independent variables); and (6) caseno , which is the
case number.
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

SPSS Statistics
Test Procedure in SPSS Statistics
The steps below show you how to analyse your data using a
binomial logistic regression in SPSS Statistics when none of the
assumptions in the previous section, Assumptions, have been
violated.
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

1. Click Analyze > Regression > Binary Logistic... on


the main menu, as shown below:
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

You will be presented with the Logistic Regression dialogue


box, as shown below:
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
2. Transfer the dependent variable heart disease into the Dependent
Variable Box, and the independent variables age, weight, gender and VO2
max into Covariates
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

2. Click on the categorical button. You will be presented with the Logistic Regression:
Define Categorical Variables dialogue box, as shown below:
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

SPSS Statistics requires you to define all the categorical predictor values in the logistic regression model. It does not
do this automatically.

Transfer the categorical independent variable, Gender , from the covariates box to the categorical covariates box
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

5. Click on the continue


button. You will be returned
to the Logistic Regression
dialogue box
6. Click on the Options and
the left box will open
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

7. Click on the button. You will be returned to the Logistic


Regression dialogue box.

8. Click on the button. This will generate the output.


MARKETING ANALYTICS
Logistic Regression vs Linear Models

• Unlike Multiple Linear Regression or Linear Discriminant Analysis, Logistic Regression fits an S-
shaped curve to the data.

• This curved relationship ensures that the predicted values are always between 0 and 1.

45
MARKETING ANALYTICS
How it is done

• To achieve this, a regression is first performed with a transformed value of Y, called the Logit function. The
equation (shown below for two independent variables) is:

• Logit(Y) = ln(odds) = a + k1x1 + k2x2


where odds refers to the odds of Y being equal to 1. To understand the difference between odds and
probabilities, consider the following example

46
MARKETING ANALYTICS
Example of Odds and Probability

• When a coin is tossed, the probability of Heads showing up is 0.5, but the odds of belonging
to the group “Heads” are 1.0. Odds are defined as the probability of belonging to one group
divided by the probability of belonging to the other.

• Thus, odds = p/(1-p) and for the coin toss example, odds = 0.5/0.5 = 1.

47
MARKETING ANALYTICS
Numerical Example

• To see how Logistic Regression works, and to compare it with


Discriminant Analysis, consider the case study described in the
Discriminant Analysis chapter on Customer Loyalty at
Raymond’s showroom. The data are shown on the next slide-

48
MARKETING ANALYTICS
INPUT DATA

FREQ Average YEARS Loyalty


Purchase
15 24765 3 0
17 18654 4 0
29 20320 1 0
25 41230 7 1
29 31462 5 1
41 7232 6 0
14 45352 4 0
27 45320 5 1
32 51500 5 1
29 45782 7 1
40 59990 9 1
13 8920 3 0
33 23250 5 1
3 35000 6 0
18 14235 2 0
21 25550 3 0
39 33330 7 1
49
31 31654 4 1
MARKETING ANALYTICS
Independent Variables

The Independent Variables are


• Freq : Frequency of purchase in a
year

• Avgpurch: Average purchase by


customer in a year

• Years : Number of years the


customer has been purchasing from
Raymond

50
MARKETING ANALYTICS
Score Computation

• As in a regression, we can compute the score for any observation. Consider the first
observation in our data, with values of 15, 24765, and 3 for each of the three independent
variables, respectively. The score for this person is (using B coeffs. From the output table
titled Predictors)

• -416.973 + 9.478 (15) + 0.006 (24765) - 5.733 (3) = -133.68.

• While building a LR equation involving categorical variables, remember to include the


coefficients of categorical variables in the equation

51
MARKETING ANALYTICS
Converting Score into Probability

• This score indicates the log of the odds of being disloyal (dependent value of 1).
• To convert this into a probability (p) of being disloyal, we use the transformation
p = e-133.68/[1+ e-133.68] = 0.
• Since the probability of disloyalty is 0, this person will be classified by the model as loyal
(forecasted value of dependent is 0).

52
MARKETING ANALYTICS
Classification of New Customer

• As shown above for an existing customer, the values of the independent variables are used
to compute a score, which is then transformed to get a probability of disloyalty. If this
probability is greater than 0.5, the customer will be classified as disloyal (1). If less than 0.5,
then he/she will be classified as loyal (0).

53
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

The Omnibus Tests of Model Coefficients is used to check that the new
model (with explanatory variables included) is an improvement over the
baseline model. It uses chi-square tests to see if there is a significant
difference between the Log-likelihoods (specifically the -2LLs) of the
baseline model and the new model. If the new model has a significantly
reduced -2LL compared to the baseline then it suggests that the new
model is explaining more of the variance in the outcome and is an
improvement!

To confuse matters there are three different


versions; Step, Block and Model. The Model row always compares the new
model to the baseline. The Step and Block rows are only important if you
are adding the explanatory variables to the model in a stepwise or
hierarchical manner. If we were building the model up in stages then these
rows would compare the -2LLs of the newest model with the previous
version to ascertain whether or not each new set of explanatory variables
were causing improvements.
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

Hosmer –Lemeshow tests the null hypothesis that predictions made by


the model, fit perfectly with observed group memberships. A chi square
statistic is computed covering the observed frequencies with those
predicted under the linear model. A non significant chi square indicates
that the data fits the model well
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
Variance explained

In order to understand how much variation in the dependent variable can be explained by
the model (the equivalent of R2 in multiple regression), you can consult the table below,
"Model Summary":

This table contains the Cox & Snell R Square and Nagelkerke R Square values, which are
both methods of calculating the explained variation. These values are sometimes referred to
as pseudo R2 values (and will have lower values than in multiple regression). However, they
are interpreted in the same manner, but with more caution. Therefore, the explained
variation in the dependent variable based on our model ranges from 24.0% to 33.0%,
depending on whether you reference the Cox & Snell R2 or
Nagelkerke R2 methods, respectively. Nagelkerke R2 is a modification of Cox
& Snell R2, the latter of which cannot achieve a value of 1. For this reason, it is preferable to
report the Nagelkerke R2 value.
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
Category prediction
Binomial logistic regression estimates the probability of an event (in this case, having heart disease) occurring. If the estimated probability of the event
occurring is greater than or equal to 0.5 (better than even chance), SPSS Statistics classifies the event as occurring (e.g., heart disease being present). If
the probability is less than 0.5, SPSS Statistics classifies the event as not occurring (e.g., no heart disease). It is very common to use binomial logistic
regression to predict whether cases can be correctly classified (i.e., predicted) from the independent variables. Therefore, it becomes necessary to have a
method to assess the effectiveness of the predicted classification against the actual classification. There are many methods to assess this with their
usefulness often depending on the nature of the study conducted. However, all methods revolve around the observed and predicted classifications, which
are presented in the "Classification Table", as shown below:

Firstly, notice that the table has a subscript which states, "The cut value is .500". This means that if the probability of a case being
classified into the "yes" category is greater than .500, then that particular case is classified into the "yes" category. Otherwise, the case is
classified as in the "no" category (as mentioned previously). Whilst the classification table appears to be very simple, it actually provides
a lot of important information about your binomial logistic regression result, including:
A. The percentage accuracy in classification (PAC), which reflects the percentage of cases that can be correctly classified as "no"
heart disease with the independent variables added (not just the overall model).
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics
oB. Sensitivity, which is the percentage of cases that had the observed characteristic (e.g., "yes" for heart disease) which were correctly
predicted by the model (i.e., true positives).
oC. Specificity, which is the percentage of cases that did not have the observed characteristic (e.g., "no" for heart disease) and were also
correctly predicted as not having the observed characteristic (i.e., true negatives).
oD. The positive predictive value, which is the percentage of correctly predicted cases "with" the observed characteristic compared to the
total number of cases predicted as having the characteristic.
oE. The negative predictive value, which is the percentage of correctly predicted cases "without" the observed characteristic compared to
the total number of cases predicted as not having the characteristic.

Variables in the equation

The "Variables in the Equation" table shows the contribution of each independent variable to the model and its statistical significance. This
table is shown below:
MARKETING ANALYTICS
Binomial Logistic Regression using SPSS Statistics

A logistic regression was performed to ascertain the effects of age, weight, gender and
VO2max on the likelihood that participants have heart disease. The logistic
regression model was statistically significant, χ2(4) = 27.402, p < .0005. The model
explained 33.0% (Nagelkerke R2) of the variance in heart disease and correctly classified
71.0% of cases. Males were 7.02 times more likely to exhibit heart disease than females.
Increasing age was associated with an increased likelihood of exhibiting heart disease,
but increasing VO2max was associated with a reduction in the likelihood
of exhibiting heart disease.
THANK YOU

Brijesh Singh
Department of Management Studies
brijeshsingh@[Link]

60

You might also like