0% found this document useful (0 votes)
13 views94 pages

SPSS Data Analysis Training Guide

The SPSS Training Manual provides an overview of SPSS (Statistical Program for Social Sciences), detailing its functionality for analyzing quantitative data through various statistical techniques. It covers essential aspects such as starting SPSS, defining variables, entering data, data cleaning, handling missing values, and generating descriptive statistics. The manual also includes instructions for creating visual representations of data, such as histograms and scatter plots, and emphasizes the importance of data integrity and error checking in statistical analysis.

Uploaded by

freshabowy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views94 pages

SPSS Data Analysis Training Guide

The SPSS Training Manual provides an overview of SPSS (Statistical Program for Social Sciences), detailing its functionality for analyzing quantitative data through various statistical techniques. It covers essential aspects such as starting SPSS, defining variables, entering data, data cleaning, handling missing values, and generating descriptive statistics. The manual also includes instructions for creating visual representations of data, such as histograms and scatter plots, and emphasizes the importance of data integrity and error checking in statistical analysis.

Uploaded by

freshabowy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

STATISITICAL DATA ANALYSIS MANUAL

SPSS
SPSS Training Manual

Introduction

SPSS stands for Statistical Program for Social Sciences. It has since been changed to PASW
Statistics (Predictive Analytics SoftWare Statistics). It is a menu driven application that is used
to analyze quantitative data. The procedures for the analysis are straight forward, although the
user needs to have sufficient knowledge and understanding of statistics to decide which analysis
is most appropriate for their study.

SPSS provides features for analyzing and displaying information using a variety of techniques.
SPSS will help us produce frequencies, cross tabulations, comparisons, correlation, regression
analyses, test hypotheses and many other statistics.

Starting SPSS

Start SPSS from the Start button or from the Desktop (depending on how it was installed on
your computer)

Click the Cancel button and choose the view you want to work from.

The SPSS window has two different views/panels – Data view and Variable view. You can
choose one by clicking on the tab in the bottom left-hand side of the window. The data view is
used for entering or displaying the data, while the variable view is used for specifying or editing
the format and properties of the variables. We normally start a new SPSS project from the
variable view.

The Variable View Window

This is where variable names are entered and defined. Before data can be entered in SPSS, you
need to know what the data represent.

• Any SPSS data set is made up of a number of observations, each of which contains values
for each variable in the dataset.
• Each variable needs to be given a variable name that is used in describing the variable to
SPSS.
Naming and Defining Variables

1. Select the Variable View of the SPSS window by clicking on the variable view tab in the
bottom left hand corner of the screen.
2. There are a number of headed columns, each of them used to indicate some facet of the
definition of each variable.
• Name: This is where variable names are typed. The names should be
alphanumeric characters, with no spaces, mathematical symbols and special
characters, and must start with an alphabetic character. The underscore character
is frequently used where a space is desired in the character name
• Type: The type column by default shows numeric for all rows. This means that
numeric (number) values will be expected in the dataset relating to these variables.
Other commonly used types include date and string.
• Variable width and decimal places: Width deals with the maximum number of
characters that will be displayed for a particular variable in all output relating to
this variable. It does not control the display in the Data View window, which is
determined by columns. For a numeric variable it needs to be considered a long
side the next column labeled Decimals. The value in this column indicates the
number of decimal places that will be displayed in all the data relating to this
variable. By default the width value is set to 8 and decimals to 2, you may reduce
this to zero if there are no decimal places in your data. For a string or date variable
the decimals column has no meaning/use.
• Variable label: This is used to inform SPSS about the details associated with each
variable name. The maximum length on any label is 256 characters and there are
no restrictions on what may appear.
• Value labels: This is used to assign numbers to represent the different categorical
explanatory variables. For example you can enter value labels for sex as 1 for
Male and 2 for Female.
• Missing values: This allows you to define which codes correspond to missing
values. You can have several values allowing you to distinguish between missing
data due to respondent forgetting to answer rather than not being applicable or
refusing to answer. For example code 97 could indicate “not applicable”, 98 could
indicate “respondent missed the question” and 99 could indicate “respondent
was not comfortable answering the question”. If a value is defined as a missing
value code, subjects with that code will be dropped from the analysis of that
variable.
• Data display: The Column and Align columns are concerned with the display of
data in the Data View window. The default value for a column is 8 characters
wide and alignment is Right. These can be altered as the user wishes.
• Measurement scale: This is concerned with the measurement scale properties of
your variables. In statistics, certain procedures are only appropriate for variables
measured on specific scales of measurement. The measure characteristics
recognized by SPSS are as follows:
o Nominal: means that the number simply represents a category of object.
There is no measured difference between the objects or people. Some
examples are assigning a number to gender (Male – 1, female – 0) or
marital status (married – 1, single – 2, divorced - 3). You are just assigning
a number to something.
o Ordinal: means the larger the number for the object, then the object is
truly larger in some sort of amount, value, importance or hierarchy. This
typically means ranks. Some examples are 1st, 2nd and 3rd places in a
contest. However, there is no exactly measured difference among the
objects. We do not know how much larger or better the 1st is compared to
the 2nd. We just know 1st is somehow larger or better than 2nd.
o Interval: means, there is a rank for the objects or people, but there is also
a measurement for the ranking. Some examples are degrees Celsius or
Fahrenheit. We know that the difference between 98 and 99 degrees is the
difference of the amount of mercury in a thermometer. Also the difference
between 42 and 43 is the same amount of mercury between 98 and 99.
However, there is no true zero, which stands for a complete lack of the
object being measured. 0 degrees does not mean that there is no mercury.
In SPSS this measure is categorized as Scale.
o Ratio: means, like interval, that there is a measurement for the ranking,
but there is also a true zero. A true zero means that there is lack of what is
being measured. Some examples are income or age. The difference between
$10,000 and $11,000 is known, or the difference between an 18 year and a
60 year old is known. Also zero income means lack of income and zero age
mean non-existence of the person. SPSS also categorizes this measured as
Scale.

Entering Data/The Data View Window

After defining the variables in the Variable View window, click on the Data View tab. An empty
spreadsheet with the variable names as column headings will be displayed. This is where data is
entered for purposes of analysis. You can also import data from other computer applications, such
as Excel and Access.

SPSS Exercise 1

1. Create an SPSS data entry template for the data in the SPSS Exercise 1.
2. Enter the data into the template and save it

Data Cleaning and Handling Missing Values

Every dataset contains some errors, and every analyst experiences a rite of passage in wasting
days drawing wrong conclusions because the errors have not been first rooted out. Up to half of
the time needed for analysis is typically spent in "cleaning" the data. This time is also, typically,
underestimated. Often, once a clean dataset is achieved, the analysis itself is quite
straightforward.

Steps in data cleaning for purposes of detecting errors

1. The key variables should be examined and corrected. These are variables that are of great
interest to the researcher. For example age, weight and height are key variables in a
nutrition study.
2. Make sure dichotomous variables are coded with 1 = category of interest and 0 =
category not interest
3. Properly fix variable labels and value labels
4. Make sure variables are of the appropriate type (numerical/string) and variable measures
are of the right level (scale/ordinal/nominal)
5. Examine write-in text for “Other” responses and recode into already existing categories
or create new categories.
6. Clean qualitative answers, but retain the respondent’s voice. Edit the punctuation and
grammar, but don’t change the meaning of the response

7. Check for and delete duplicate data entries. Experience has it that many data entrants
duplicate data in the process of data entry. The analyst should therefore be able to check
for duplicate entries and delete them from the data set.
8. Perform some descriptive statistics to see if the data make sense
• Do the maximum and minimum values fall within the question’s expected range?
• Does the mean make sense for that question?
• How about the frequency tables for categorical data? Aren’t there some “fake”
entries in your dataset?
9. Check for missing data. How many entries were missed for each variable? Is there a
pattern in the missing values? Can they be estimated and entered?
• Make sure missing data is coded properly
• Sort your data basing on one of the variables to establish some useful order of your
data to find out whether you can estimate the missing data.
10. Generate some charts
• Histograms for Likert/Ordinal/Continuous variable
• Scatter plots for write-in continuous variables (e.g. age) to look for outliers and
nonsense values
• Use frequency tables and histograms to examine normality of distributions, the
range and extreme skew and kurtosis
11. Perform descriptive statistics again. Make sure everything makes sense
12. Screen data to suit the requirements of a particular statistical test (depends on the analysis
to be performed)

Initial Data Checking

It is extremely important to check the data entered as carefully as possible to try detecting
inconsistent data and straight forward errors. Errors like whether there is “Never” in “Ever
Smoked” and then “5” in “No of Cigars Smoked in a Day”
• Go to View→Value labels
• Go to Analyze→Reports→Case summaries
This feature allows you to look a single column or a number of columns separately from
the rest of the data. Take your time to make sure that the data is “clean” especially if it
was entered by different people.

Checking for Duplicate Entries

It is extremely rare (almost impossible) for you to have entries that are similar (duplicate)! Same
sex, same height, same weight, sex age, same marital status, same religion, same education level,
same home and same opinion over an issue? That is impossible! Duplicate values should therefore
be removed from the dataset.

• Select Data→Identify Duplicate Cases


• Move the variables you want to test to the area labeled “Define matching case by:”
• You may or may not sort within the matching cases
• Click on Ok.
A new variable labeled Primary Last will be created and appended at the end of the
spreadsheet.
Primary or original cases will marked 1, while the duplicates will be marked 0.
Check your questionnaire to establish the genuineness of these cases, else delete them
from the data set.

Generating Descriptive Statistics

Descriptive statistics can assist an analyst to clean the collected data. Generating mean,
minimum, maximum and frequencies will reveal some anomalies in the data set.

• Go to Analyze→Descriptive Statistics→Frequencies
• Move the variable of interest into the Variable box on the right hand side.
• Click on Statistics to select some summary statistics such as range, maximum, minimum
and mean.
• Click in Ok to generate the output.

Generating Descriptive Statistics


Descriptive statistics are mostly generated for discrete and continuous. The most common ones
include: mean and sum. You can also generate dispersion statistics is the same window as that
for descriptives, such maximum, minimum, range and standard deviation. Skewness and kurtosis
can also be generated.

• Select Analyze→descriptive Statistics→Descriptives


• Place the variables of which you want to generate Descriptive Statistics in the Variable
box
• Click on Options
• Check the statistics you want to generate
• Click Continue and finally Ok

Handling Missing Data

Types of Missingness

1. Missing Completely At Random (MCAR)

There is no pattern – the cases with missing data are indistinguishable from those that are
complete. This is the best scenario of missing data. In this case removing cases with missing data
does not bias your inferences.

2. Missing at Random (MAR)

The more general assumption for MAR is that probably a variable with missing depends only on
available information. The probability of non-response to the question depends only on other
fully recorded variables.

3. Missing Not At Random (MNAR)/Not Missing At Random (NMAR)

Not random, but missingness cannot be predicted by variables in the [Link] frequency
tables are generated, the values of missing responses can be identified for each of the variables.
Procedure

Identify the type of missingness (MCAR, MAR or NMAR). You may delete some records with
missing data if data is missing in a key variable. You may sort your data basing on one of the
fields and choose to replace missing data using one of the following methods: mean of nearby
point, median of nearby point or linear interpolation.

• Go to Transform→Replace missing values


• Select the Variable with missing values
• Give a name to the new variable that will be created
• Select either Mean of nearby point, Median of nearby point or Linear interpolation
to replace the missing values
• Click on Change
• Click on Ok

Summary of how missing values are handled in SPSS analysis commands

• Descriptives
For each variable, the number of non-missing values are used. You can specify the
missing-listwise subcommand to exclude data if there is a missing value on any variable
in the list.
• Frequencies
By default, missing values are excluded and percentages are based on the number of non-
missing values. If you use the missing-listwise subcommand on the frequencies
command, the percentages are based on the total number of non-missing and user-
missing values and the percentage of user-missing values are reported in the table.
• Correlations
By default, correlations are computed based on the number of pairs with non-missing data
(pairwise deletion of missing data). The missing-listwise subcommand can be used on
the corr command to request that correlations be computed only on observations with
complete valid data for all variables on the var subcommand (listwise deletion of
missing data).
• Regression
If values of any of the variables on the var subcommand are missing, the entire case is
excluded from the analysis (i.e., listwise deletion of missing data). It is possible to
further control the treatment of missing data with the missing subcommand and one of
the following keywords: pairwise, meansubstitution, or include.
• Factor
Cases with missing values are deleted listwise, i.e., observations with missing values on
any of the variables in the analysis are omitted from the analysis.
• ANOVA
Cases with any missing value are excluded from any single complete ANOVA design in
which the missing value is encountered. It is possible to specify more than one ANOVA
design with a single anova command.
Designing Histograms for Categorical Data

A histogram will display the frequencies of categorical as bars and if you check the option “Show
normal distribution of the histogram”, the normal distribution curve will be shown on the
histogram. This will help you identify the outliers which are displayed outside the normal curve.

• Select Analyze→Descriptive Statistics→Frequencies


• Select Charts→Histogram
• Check Show normal distribution of the histogram
• Click on Continue
• Click on Ok

Designing Scatter Plots for Scale Data

A scatter plot can be used to graph scale data and identify anomalies in your data. The Y-axis
should display the variable under investigation while the X-axis may display a categorical
variable.

• Select Graph→Legacy dialogs→Scatter/dot


• Select Simple Scatter
• Place the Variable under investigation in the Y-axis area
• Place the categorical data in the X-axis area
• Click on Ok

Retrieving Data Files

Retrieving an SPSS file is almost similar to opening files in other windows computer based
applications.

• Select File→Open→Data
• Select the required file.
• Click on Open

Reading Data from Spreadsheet Formats (Excel, Lotus 1-2-3)


While in Excel or Lotus 1-2-3, in the first row, type the names of the variables. Take note also
of the range of the data within the spreadsheet. SPSS will only be able to read as single worksheet
at a time (not the entire workbook)

• Select File→Open→Data
• Change to Excel or Lotus in the section: Files of type
• Select the file you want to open
• You will be prompted with whether you want to make Variable names from the data in
the first row of the table being imported.
• Select the Worksheet where the data is located and specify the range (if any)
• Click on Continue to import the file into SPSS

Editing and Modifying a Dataset

Having done some preliminary analysis, we may need to change the data. We may need to delete
or insert new variables. You may even need to transform the data into new variables.

Inserting Data

When you discover realize at a later time that some data records are missing in your dataset, you
need to insert a new case.

• Click on the row(s) where you want to insert a new record


• Click on Data→Insert Case
• A new blank row(s) is created
• Type in the missing data record(s)

Deleting a Case

If for some reason(s) you find that there are some data records (cases) you want to delete from
your table, then go ahead and identify and delete them.

• Select the Cases to delete


• Select Edit→Clear

Inserting a variable
You may need to capture some extra Variable(s) which you had not taken care. This may be at
the end of the questionnaire/dataset or in the middle.

• Click on the variable after the position at which you wish the variable to appear
• Select Data→Insert variable
• A blank column is inserted before the selected variable.
• Type in the Variable Name, followed by the data.

Deleting a variable

You may no longer be interested in one of the variable in the dataset. To delete a variable;

• While in variable view, highlight the column of the variable you want to delete.
• Select Edit→Clear

Exercise 2

1. Open an excel data file called SPSS Exercise 2.


2. Use it for next set of procedures.
3. Below are the variable properties
ID: Identification Number, Location: 1 – Central, 2 – Nakawa, 3 – Rubaga, 4 – Makindye,
5 – Kawempe
Sex: 0 – Female, 1 – Male, Educ_LV: 0 – Informal, 1 – Formal, Occupn: 0 – Formally
Employed, 1 – Self Employed, MStatus: 0 – Married, 1 – Others, Distance: Distance
from home to work (kms), Age: Age of Household Head, FSize: Family Size, Income:
Income of Household Head, Saving: Monthly Household Savings.

Generating Frequency Tables

• Go to Analyze→Descriptive Statistics→Frequencies
• Move the variable of interest into the Variable box on the right hand side.
• Click in Ok to generate the output.

Generating Cross-tabulations
To examine the relationship between two categorical variables, a two way Frequency Table can
be used. This is called a Cross-tabulation. Suppose we want to know how Education Level is
related to Sex. We examine this by cross tabulating Education Level and sex.

• Select Analyze→descriptive Statistics→Crosstabs


• Place Education Level in the Row(s) box to make it the row variable
• Place Sex in the Column(s) box to make it the column variable
• Click Ok
Two way frequency tables are more informative if they include percentages
• Select Cells from the crosstabs screen, to add percentages
• Check rows, columns and totals in the percentage section
• Click Continue and finally Ok

Generating 2 Way Cross-tabulations

You may need to do comparison on three variables to help you visualize your data even in a much
better way. The 2 way cross tabulation will help you.

• Select Analyze→descriptive Statistics→Crosstabs


• Place Education Level in the Row(s) box to make it the row variable
• Place Sex in the Column(s) box to make it the column variable
• Place another variable say, Location in the box labeled Layer 1 of 1
• Click the Ok button
Two way frequency tables are more informative if they include percentages
• Select Cells from the crosstabs screen, to add percentages
• Check rows, columns and totals in the percentage section
• Click Continue and finally Ok

Generating Descriptive Statistics


Descriptive statistics are mostly generated for discrete and continuous. The most common ones
include: mean and sum. You can also generate dispersion statistics is the same window as that
for descriptives, such maximum, minimum, range and standard deviation. Skewness and kurtosis
can also be generated.

• Select Analyze→descriptive Statistics→Descriptives


• Place the variables of which you want to generate Descriptive Statistics in the Variable
box
• Click on Options
• Check the statistics you want to generate
• Click Continue and finally Ok

Multiple Responses

There are data collection situations in which several responses or measurements are recorded for
a single question. You ask respondents to name their soft drink or their best performing members
of parliament. When measuring customer satisfaction, you might ask those surveyed to tell you
what they like and what they don’t like about your product or service.

A second way in which multiple response data can be collected is through a checklist type format.
For surveys this means the respondent is presented with a list and is asked to check those that
apply.

Capturing Multiple Responses


After collecting the data we need to peruse through all the responses collected and code them.
We either create one variable for that response if each respondent was to give only one response,
or we create several variables if each respondent was to give several responses.

Example of question No of Example of variable names


variables
Name two Members of Parliament you 2 Best_MP_One, Best_MP_Two
consider the best performing
List four of your favorite soft drinks 4 Drink_1, Drink_2, Drink_3,
Drink_4
Do you use any of these means of 4 Taxi, Bus, Boda boda, Personal
transport to and from work? 1. Taxi, 2. car
Bus, 3. Boda boda, 4. Personal car

Analyzing Multiple Responses


• Select Analyze→Define multiple response
• Place the multiple responses (variable) into the Variable in set area
• Specify the range of the responses (e.g .1 – 4) or select Dichotomous to analyze only one
of the responses and specify the response to analyze e.g. 1
• Click on Add to create a new variable containing the multiple responses.
• Select (again) Analyze→Define multiple response
• Select Frequencies or Crosstabs to analyze your multiple responses
SPSS Exercise 3

A). Create an SPSS file and enter the data below:

1. Interview Date: Today’s date


2. Respondent ID: 3. Sex:
4. Date of birth 5. Level of Education
6. Which means of transport do you use to go to work?
• Boda boda/Taxi/Personal car/Bus
7. Name 3 members of parliament that you view as articulate and are measuring up to
the task?

ID Interview Date Sex Date of birth Level of Education Transport


MPS

1 Today’s Date 1 30/07/1977 4 Taxi Hon.


John

Boda boda Hon.


Betty

Hon.
Moses

2 Today’s Date 0 6/7/1999 3 Boda boda Hon.


Jane

Bus Hon.
Betty

Hon.
Cissy

3 Today’s Date 1 8/12/2000 2 Boda boda Hon.


Betty

Hon.
Moses
Hon.
Ken

4 Today’s Date 0 5/3/1986 1 Taxi Hon.


Mary

Bus Hon.
Sam

Hon.
Ken

5 Today’s Date 1 6/9/1990 4 Personal Car Hon.


Betty

Hon.
Cissy

Hon
Ken

Questions

1. Which means of transport is most used?


2. Who is the most known MP?
3. What is the most used means of transport for each sex?
4. Who is the most popular MP amongst the Male and Female respondents?
Transformation of Data

Data transformation leads to construction of new variables. This will be done basing on a number
of requirements; you may want to extract Age from Data of Birth, you may want to generate
NSSF contribution from Gross Salary paid to employees, you may want to transform the codes
for Strongly Disagree and Agree to a Generally Agree code, you may want to transform non-
parametric data to become parametric, etc.

Computing a New Variable

Let us use the SPSS Exercise 2 data to compute a new variable called D-Income (Disposable
Income) defined as Income minus saving.

• Select Transform→Compute
• Enter the name D_Income in the Target Variable window
• Click on Type & Label to define the Variable Type and Variable Label
To build up a mathematical expression which will create a new variable, you can choose
variables from the left hand box, then click the Right Arrow button. The expression will
appear in the Numeric Expression window.
• In our example, Click on Income from the left hand side, then – (minus) from the
calculator pad, then Saving from the left hand side.
• Finally, Click the Ok button.
A new variable D_Income will now appear in your dataset. It can now be used or analyzed
like any other variable in the dataset.

Computing Duration of Time Difference by Built-in Functions

There are some variables like date of birth, date of interview or date of assessment which are
stored in date format. But one is able to calculate the time difference (in days) by using the
function [Link]. The age of the respondents on the date of interview (SPSS Exercise 3)
can be calculated from the date of birth and the Interview date.

• Go to Transform→Compute
• In the Target Variable window, type Age_of_respondent
• Select the functions group Time Duration Extraction followed by [Link] in the
Functions and Special Variables window using the Up and Down arrows.
• Click on the Functions Up Arrow key, this will put the function with a “?” in the
parentheses in the box named Numeric Expression.
• Now select the Variable to replace the “?” i.e. Interview Date by clicking the Right
Arrow key.
• Perform the same procedure for Date of Birth.
• You can now compute the difference, Time (in days), you can get the difference in Years
by dividing the expression by 365 (number of days in a year) to get Age of Patients in
years.
The expression should like: ([Link](Interview Date)-[Link](date of
birth))/365
Recoding a Variable

To assist in data analyses we often need to group a continuous variable into categories. For
example age may be recoded into age groups, income into income groups or brackets, etc.

Let us recode the variable Age_of_head for the city survey dataset.

• Select Transform→Recode
• Two options are available: Into Same Variables and Into Different Variables. The first
option leads to potentially valuable information being over-written. It is usually best to
use the second option, which leaves the original variable intact, since you might still need
in its original format.
• Select Into Different Variables
• Choose an Input Variable from the list on the left hand side (age_of_head in our case),
then press the Right Arrow button.
• Specify the Name (e.g Age_Group) and Label of the output variable
• Select Old and New Values
Suppose we wish to recode into these groups: <20, 20 – 29, 30 – 39, 40-49, 50-59 and 60
and above.
• Click on Range Lowest Through and enter 19 in the box, then click on value under New
Value and type 1 and finally press Add.
• Click on Range then type 20 and 29. Then on New Value and enter 2 and finally press
Add.
• Click on Range then type 30 and 39. Then on New Value and enter 3 and finally press
Add.
• Click on Range then type 40 and 49. Then on New Value and enter 4 and finally press
Add.
• Click on Range then type 50 and 59. Then on New Value and enter 5 and finally press
Add.
• Finally click on Range Through Highest and enter 60, then click on New Value and
enter 6 and then press Add.
• Once you have specified all the Old→New Recodes, click on Continue, then Ok on the
Recode into Different Variables screen. A new variable is created with entries ranging
from 1 to 6.
• After recoding a variable, it is advisable to run case summaries to compare the old and
the new values.

SPSS Exercise 4

1. Open as SPSS file called SPSS Exercise 4.


2. Recode the variable starting from question_B1 to Question_E14 from the current likert
scale to a dichotomous one where Strongly Agree and Agree become 1 and Disagree and
Strongly Disagree become 0.
Recoding a Variable into a Dichotomous Variable

When you have a variable that allows you to enter entries from a likert scale of several responses
like the one below:

1 2 3 4 5 6 7

Strongly Moderately Somewhat Undecided Somewhat Moderately Strongly


Disagree Disagree Disagree Agree Agree Agree

There may be need for the analyst to combine the responses Strongly Disagree, Moderately
Disagree and Somewhat Disagree to one of Disagree. Then also combine Somewhat Agree,
Moderately Agree and Strongly Agree to one of Agree and possibly discard all the responses
showing Undecided. This will lead to a new dichotomous variable.

The likert could also comprise of 5 measures as shown below.

1 2 3 4 5

Strongly Agree Agree Not Sure Disagree Strongly Disagree

• Select Transform→Recode into Different Variables


• Move the variable to recode into the Numeric Variable →Output Variable box
• Type the name of the new variable in the Name box and give it a label in the Label box
• Click on Old and New Values
• Select Range and type in 1 and through 2 for Strongly Agree and Agree
• Under the area New Value type 1 in the Value box and click on Add. The code 1 stands
for Agree
• Select Range, again, and type in 4 and through 5 for Disagree to Strongly Disagree
• Under the area New Value type 0 in the Value box and click on Add. The code 0 stands
for Disagree
• Click on Continue
• Click on the Change button, then click Ok
• A new dichotomous variable is created. It has only 0 and 1 codes. Note that the responses
3: Not Sure has been ignored.
Merging Data in SPSS

SPSS allows you to merge two different data files to make one. This can be done by either adding
cases or adding variables depending on the situation at hand.

• Select Data→Merge Files→Add cases or Add Variables


• Select Browse to go to the location where the file is
• Select the file and click on Open
• Click on Continue. You will be informed of the variables which have been excluded and
those that have been excluded.
• Click on Ok to complete the process.

Exercise 5

1. A customer survey was conducted and data was entered by three data entrants. The three
data sets are named Customer Survery Data_1, Customer Survery Data_2 and
Customer Survery Data_3.
2. Merge the three data sets and save the merged data set as Customer Survey Data.
Examining Assumptions of Parametric Statistics

Parametric tests have two main assumptions:

• That data is approximately normally distributed


• That there is homogenous variances among groups

Test for Normality

Before you conduct any parametric tests you need to check that data values come from an
“approximately normal” distribution. To do this, you can compare the frequency distribution of
your data values with those of a normalized version of these values. If the data are approximately
normal, the distributions should be similar.

Let us objectively determine whether the distributions of the data in SPSS Exercise 1(savings)
vary significantly from a normal distribution by conducting a normality test. This test will
provide you with a statistic that determines whether your data are significantly different from
normal. The null hypothesis is that the distribution on your data is NOT different from a normal
distribution. The Alternative hypothesis is that the distribution on your data is different from a
normal distribution. We reject the null hypothesis is the P-Value is less than 0.05 (meaning that
the chances of the null hypothesis being true is less than 0.05)

Exercise

1. Open the SPSS Exercise 2.


2. Test the following variables for normality: Distance, Age, FSize, Income and Saving.

Below is the procedure for Testing Normality for Distance

1. State the hypotheses


Null Hypothesis: The distribution of the data is NOT significantly different from a
normal distribution
Alternative Hypothesis: The distribution of the data is significantly different from a
normal distribution
2. Run the Analysis
• Select Analyze→Nonparametric test
• Select 1 Sample K-S
• Put the response variable, Distance into variable box on the right
• Click Normal in the Test Distribution
• Click the Ok button
Outputs
One-Sample Kolmogorov-Smirnov Test

DISTANC
E

N 120
Normal Parametersa Mean 9.85
Std. Deviation 5.225
Most Extreme Absolute .097
Differences Positive .097
Negative -.094
Kolmogorov-Smirnov Z 1.061
Asymp. Sig. (2-tailed) .210
a. Test distribution is Normal.
Observations
a) An output table shows a Komolgorov-Smirnov(K-S) table for the data. Your p-value is
the last line on the table:”Asymp. Sig.(2-tailed)”. Note the test is 2-tailed, meaning we
shall be looking at P-Value of margin 0.025.
b) If p<0.025 (i.e., there is a less than 5% chance that your null hypothesis is true, we reject
the null hypothesis), we can therefore conclude that the distribution of the data is
significantly different from a normal distribution. (Note: Always look at the p-value.
Don’t trust the “Test distribution is normal” note. It may be telling lies!
c) If p>0.025 (i.e., there is a greater than 5% chance that your null hypothesis is true), we
accept the null hypothesis), we can therefore conclude that the distribution of the data is
NOT significantly different from a normal distribution.
d) For the distance data, the p-value is 0.210 which >0.025, we therefore conclude that the
distance data set is NOT significantly different from a normal distribution.

Test for Normality of the FSize data


When you run the normality test for the FSize dataset you get the output shown below.
One-Sample Kolmogorov-Smirnov Test

FSIZE

N 120
Normal Parametersa Mean 6.08
Std. Deviation 3.760
Most Extreme Absolute .150
Differences Positive .150
Negative -.115
Kolmogorov-Smirnov Z 1.639
Asymp. Sig. (2-tailed) .009
a. Test distribution is Normal.

Observations
• Look straight to the p-value (0.009), which is less than 0.025, this call for rejection of the
null hypothesis, thereby concluding that the data is significantly different from a normal
distribution.

SPSS Exercise 5
1. Test the following data sets for normality and make the appropriate interpretations and
conclusions: Age, Income and Savings.

Transformation of Data

If your data do not satisfy the test for normality, you may have to transform it so that they do. If
the transformed data meet the test for normality, you may proceed by running the appropriate
test on the transformed data. If after a number of attempts the transformed data does not meet
the assumptions of parametric statistics, you must run non-parametric tests.

To Transform the Data:

• Select Transform→Compute
• In the Target Variable box, you name the new transformed variable (for example
Log_FSize).
• In the Function Group box on the right, highlight Arithmetic by clicking on it once.
Various functions will show up in the Functions and special variables box. Choose
LG10 function. Double on it.
• In the Numeric Expression box, it will show LG10(?). Double click on the name of the
variable you want to transform (e.g. FSize) in the box on the lower left to make FSize
replace “?”.
• Click Ok. SPSS will create a new column in your data sheet that has the log values of the
FSize data.
• After transforming the data, redo the tests of normality to see if the transformed data
now meet the assumptions of parametric statistics, before choosing the test to use on your
data. If the transformed data still do not meet the assumption, you can do a non-
parametric test on the original data

Observations…
The transformed data produces the following output table:

One-Sample Kolmogorov-Smirnov Test

Log_FSize

N 120
Normal Parametersa Mean .7122
Std. Deviation .25681
Most Extreme Absolute .126
Differences Positive .083
Negative -.126
Kolmogorov-Smirnov Z 1.376
Asymp. Sig. (2-tailed) .045
a. Test distribution is Normal.
Notice the p-value is now greater than 0.025, so the Log_FSize data are NOT significantly
different from a normal distribution. So we can now use parametric tests on this dataset.

Test for Homogeneity

Another assumption of parametric tests is that the variances of each of the groups that you are
comparing have relatively similar variances. Most of the comparative tests in SPSS will do this
test for you as part of the analysis. For example when you run a t-test, the output will include
columns labeled “Sig” and will tell you whether or not your data meet the assumption of
parametric statistics.

If the variances are not homogenous, then you must either transform your data to see if you can
equalize the variances, or you can use a non-parametric test that does not require this assumption.

Comparing Means among Groups

Comparing Means among Groups: Two Sample and Paired t-test

This compares means between two groups, such as the Distance from home to work for two
sexes.

Hypotheses

Null Hypothesis: The variances of the two groups are NOT significantly different from each
other

Alternative Hypothesis: The variances of the two groups are significantly different from each
other

To run a two- sample t-test on the data:

• First be sure the data is un-split


• Select Analyze→Compare Means
• Select Independent Samples T-Test
• Put the Distance variable in the Test Variable box and Sex in the Grouping Variable
box
• Click on the Define Groups button and put the Names of the Groups in each box
• Click Continue and then Ok.

Outputs

Group Statistics

Std. Std. Error


SEX N Mean Deviation Mean

DISTANC Female 46 6.52 3.758 .554


E Male 74 11.92 4.948 .575

Independent Samples Test

Levene's Test for


Equality of
Variances t-test for Equality of Means

95% Confidenc

Sig. Mean Interval of th

(2- Differenc Std. Error Difference


F Sig. t df tailed) e Difference Lower Upper

DISTANCE Equal
variances 2.624 .108 -6.344 118 .000 -5.397 .851 -7.082 -3.712
assumed

Equal
variances not -6.758 113.210 .000 -5.397 .799 -6.979 -3.815
assumed

Observations

• The first table shows the means and standard deviations of the two groups.
• The second table shows the results of the Levene’s test for equality of variances, the t-
value of the test, the degrees of freedom of the test and the p value which is labeled “Sig.”
• The variance of the two groups are NOT significantly different from each other (p>0.05)
• We can therefore use parametric tests on this data set as it has complied to both the test
for normality and the test for homogeneity.

Let us look at the results of the t-test.

Observations

• Now that the variances of the two groups are NOT significantly different from each other
(p>0.05), we can now focus on the results of the t-test.
• Remember, the Null Hypothesis: The means of the two groups are NOT significantly
different from each other.
• For the t-test, p=0.000 (which is less than 0.025), so we conclude that the two variances
are significantly different from each. The mean difference equals 5.397.

Paired t-test

You should analyze your data with a paired t-test only if you have paired samples during data
collection. This analyzes tests to see if the mean difference between samples in the pair is zero.

The Null hypotheses: The difference is NOT different from zero

The Alternative hypothesis: The difference is different from zero.

Recommendations: Reject the Null hypothesis if the P-value <0.05.

Example:

You may have done a study in which you investigated the effect of light intensity on the growth
of a certain plant. You took cuttings from source plants and for each source plant, you grew 1
cutting in a high - light environment and 1 cutting in a low-light environment. The other
conditions were kept constant between the groups. You measured growth by counting the
number of leaves over the course of the experiment.

Below is the data;


Plant 1 2 3 4 5 6 7 8 9 10
Low- 2 4 1 3 2 5 4 1 3 4
light
High- 3 6 2 4 5 6 5 2 5 5
light

Procedure;

• Enter the data in the columns named Plant, Low and High, Each row in the spreadsheet
should have a pair of data
• In the Variable View, leave the measure column on Scale. The Values area should be left
blank.
• Select Analyze→Compare Means
• Select Paired Sample T-test
• High light both of the variables and hit the Arrow button to put them in the paired –
variable box. They will show up as “Low-High”
• Click Ok button

Outputs

Paired Samples Statistics

Std. Std. Error


Mean N Deviation Mean

Pair 1 Low 2.90 10 1.370 .433

High 4.30 10 1.494 .473


Paired Samples Test

Paired Differences

95% Confidence
Interval of the

Std. Std. Error Difference Sig. (2-


Mean Deviation Mean Lower Upper t df tailed)

Pair 1 Low - High -1.400 .699 .221 -1.900 -.900 -6.332 9 .000
Observations
• The first table shows the summary statistics for the 2 groups
• The second table shows the paired sample correlation
• The third table, the Paired Sample Tests table, is the one of interest. It shows the mean,
your t value, df and the p-value (labeled as Sig(2 tailed)). In this table, the p-value reads
0.000, which means that it is very low. We express this when reporting as <0.001. We
therefore reject the null hypothesis.
• The difference is therefore different from zero. We can now report that plants in high
light environment added significantly more leaves than their counter part plants in the
low light environment. Note that the mean for high light environment is 4.30, while the
low light environment is 2.90.

Let us run a test for homegeneity on the Income data set.

The following is the output after the test:


Group Statistics

Std. Std. Error


SEX N Mean Deviation Mean

INCOM Female 46 5.42E5 185712.501 27381.814


E Male 74 7.26E5 328880.950 38231.634

Independent Samples Test

Levene's Test for


Equality of
Variances t-test for Equality of Means

95% Confid

Sig. Interval of

(2- Mean Std. Error Difference


F Sig. t df tailed) Difference Difference Lower Uppe

INCOME Equal variances - -


-
assumed 13.980 .000 -3.465 118 .001 184105.40 53127.842 289312.99
7889
1 3

Equal variances - -
117.11 -
not assumed -3.915 .000 184105.40 47025.755 277236.52
1 9097
1 0

Obeservations

• Ignore the first table


• The second table shows the results of the Levene’s test for equality of variances, the t-
value of the test, the degrees of freedom of the test and the p value which is labeled “Sig.”
• The variance of the two groups are significantly different from each other (p<0.05)
• We do not proceed to interpret the results of the t-test unless there is NO significant
difference in the means of the two groups.
• We therefore need to transform the data for us to use parametric tests.
Let us look at the output after transformation of the data.

Independent Samples Test

Levene's
Test for
Equality of
Variances t-test for Equality of Means

95% Confidence

Sig. Mean Interval of the

(2- Differen Std. Error Difference


F Sig. t df tailed) ce Difference Lower Upper

Log_Income Equal
variances 5.564 .020 -3.639 118 .000 -.11113 .03053 -.17159 -.05066
assumed

Equal
112.73
variances not -3.868 .000 -.11113 .02873 -.16804 -.05421
9
assumed

Observations

• The the variances of the two groups are still singnificantly different fromeach other
(p<0.05). You there fore need to use non-parametric tests on this dataset.

Comparing Two Groups Using Non-Parametric Tests: Mann Whitney U-Test

The Mann-Whitney U Test is used to compare differences between two independent groups
when the dependent variable is either (a) ordinal or (b) interval but not normally distributed. It
therefore doesn’t require normality or homogeneity of variances, but is less powerful than the t-
test. It is less likely to show a significant difference between the groups. So whenever data is
approximately normal, then use the t-test.

To Run a Mann-Whitney U Test


• Go to Analyze→Non-Parametric Tests
• Select 2 Independent Samples. A dialogue box will appear.
• Put the variables in the appropriate boxes, define your groups, and confirm that Mann-
Whitney U Test type is checked.
• Click the Ok button
Output

Ranks

SEX N Mean Rank Sum of Ranks

INCOM Female 46 46.05 2118.50


E Male 74 69.48 5141.50

Total 120

Test Statisticsa

INCOM
E

Mann-Whitney U 1.038E3
Wilcoxon W 2.118E3
Z -3.587
Asymp. Sig. (2-
.000
tailed)
a. Grouping Variable: SEX

Observations

• The first table shows the parameters used in the calculation of the test.
• The second table shows the statistical significance of the test, U statistic shown in the
first row and the p-value labled “[Link].(2-tailed)”. The p-value =0.000, which
means that the Incomes of the two sexes (Female and Male) are significantly different
from each other (p<0.025). The mean income for Females is 541,833 and a meadian of
488,500, while that for Males is 725,938 and a median of 595,600.

Comparison of Three or More Groups using the Parametric Statistics: One Way ANOVA
and Post-Hoc Tests

We shall compare the Age of Household Head from the different locations of our dataset.
The appropriate parametric statistical test for continuous data with one independent variable and
more than 2 groups is the one way analysis of variance (ANOVA). It tests whether there is a
significant difference among the means of the groups, but does not tell us which means are
different from each other. In order to find out which means are significantly different from each
other you have to conduct “POST HOC PAIRED COMPARISONS”. These are called Post-
hoc, because you conduct the tests after you have completed ANOVA and it shows where the
significant differences lie among the groups. One of the post-hoc test is the “FISHER PLSD
(Protected Least Significance Difference) test, which gives you a test of all pair-wise
combinations.

To run ANOVA test:

• Select Analyze→Compare Means


• Select One –Way ANOVA
• In the dialog box put Age variable in the Dependent list box and and Location in the
Factor box.
• Click on Post-hoc button and then LSD check box and then on Continue
• Click Options button and check 2 boxes: Descriptive Statistics and Homogeneity test
• Then click Continue and Ok

Observations
Descriptives
AGE

95% Confidence Interval for


Mean

N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum

Central 31 32.52 10.033 1.802 28.84 36.20 18 59


Nakawa 29 39.31 10.744 1.995 35.22 43.40 22 62
Rubaga 14 29.00 6.214 1.661 25.41 32.59 20 40
Makindye 27 34.04 14.818 2.852 28.18 39.90 18 88
Kawempe 19 31.26 9.291 2.131 26.79 35.74 21 60
Total 120 33.89 11.374 1.038 31.84 35.95 18 88

• The first table gives you some basic descriptive statistics for the three Islands.

Test of Homogeneity of Variances


AGE

Levene
Statistic df1 df2 Sig.

1.645 4 115 .168

• The second table gives you the results of the Levene test (which examine the assumption
of homogeneity of variances). You must assess this before looking at the results of
ANOVA.
• In this case, your variances are homogenous (p>0.05), the data meets one of the
assumptions of the test. Thus, you can proceed to use the results of ANOVA comparisons
of the mean..
ANOVA
AGE

Sum of
Squares df Mean Square F Sig.

Between
1376.996 4 344.249 2.824 .028
Groups
Within Groups 14016.596 115 121.883
Total 15393.592 119

• The third table gives you the results of ANOVA test, which examined whether there were
any significant differences in mean density among the three island populations of marine
iguanas.
• Look at the p-value in the ANOVA table (“Sig.”). If the p-vale is <0.05, then at least one
mean is significantly different from the others. In this case, p=0.028 in the ANOVA table,
so the mean Ages are significantly different from each other.
• Now that you know that the means are significantly different, you want to find out which
pairs of means are different from each other.
Multiple Comparisons
AGE
LSD

(I) (J) 95% Confidence Interval


LOCATI LOCATI Mean Difference
ON ON (I-J) Std. Error Sig. Lower Bound Upper Bound

Central Nakawa -6.794* 2.852 .019 -12.44 -1.14

Rubaga 3.516 3.555 .325 -3.53 10.56

Makindye -1.521 2.906 .602 -7.28 4.24

Kawempe 1.253 3.217 .698 -5.12 7.62

Nakawa Central 6.794* 2.852 .019 1.14 12.44

Rubaga 10.310* 3.593 .005 3.19 17.43

Makindye 5.273 2.952 .077 -.57 11.12

Kawempe 8.047* 3.258 .015 1.59 14.50

Rubaga Central -3.516 3.555 .325 -10.56 3.53

Nakawa -10.310* 3.593 .005 -17.43 -3.19

Makindye -5.037 3.636 .169 -12.24 2.17

Kawempe -2.263 3.889 .562 -9.97 5.44

Makindye Central 1.521 2.906 .602 -4.24 7.28

Nakawa -5.273 2.952 .077 -11.12 .57

Rubaga 5.037 3.636 .169 -2.17 12.24

Kawempe 2.774 3.306 .403 -3.77 9.32

Kawempe Central -1.253 3.217 .698 -7.62 5.12

Nakawa -8.047* 3.258 .015 -14.50 -1.59

Rubaga 2.263 3.889 .562 -5.44 9.97

Makindye -2.774 3.306 .403 -9.32 3.77


*. The mean difference is significant at the 0.05 level.
• The Post Hoc tests, Fisher LSD, allow you to examine all pair-wise comparisons of mean.
The results are listed in the fourth table. To find out which groups are different from each
other, look at the Sig column for each comparison and identify paires where the P-Value
<0.05. Thus, Central and Nakawa have a significantly different mean age, Nakawa and
Rubaga also have a signicantly different mean age and well as Kawempe and Nakawa.
• The rest of the pairs do NOT have a significantly different mean age from each other.(i.e.
Central and Kawempe, Central and Rubaga, Central and Makindye, Nakawa and
Makindye, Kawempe and Makindye,Makingye and Rubaga, Kawempe and Rubaga…

Comparison of Three or More Groups Using Non-Parametrict Tests: Kruskal-Wallis


Tests

The Kruskal-Wallis Test is the nonparametric test equivalent to the one-way ANOVA and an
extension of the Mann-Whitney test to allow the comparison of more than two independent
groups. It is used when we wish to compare three or more sets of scores that come from different
groups.

As the Kruskal-Wallis test does not assume normality in the data and is much less sensitive to
outliers it can be used when these assumptions have been violated and the use of the one-way
ANOVA is inappropriate. In addition, if your data is ordinal then you cannot use a one-way
ANOVA but you can use this test.

To Run a Kruskal-Wallis Test

• Go to Anlyze→Non-Parametric Tests
• Select Legacy Dialogs→K independent Samples
• Put your varibles in the apprpriate boxes, Define your groups, and be sure Kruskal –
Wallis box is checked.
• Click Ok button.
Ranks

EDUC_L
V N Mean Rank

INCOM Primary 8 58.38


E Secondary 33 49.39

Diploma 55 55.44

Degree 24 88.08

Total 120

Test Statisticsa,b

INCOM
E

Chi-Square 19.650
df 3
Asymp. Sig. .000
a. Kruskal Wallis Test
b. Grouping Variable:
EDUC_LV

Observations

• The first table shows the parameters used in the calculations of the test
• The second table shows the statistical results of the test, The test statistics that gets
calculated is the Chi-square value and it is the first row of the second table. The p-value
is labled “[Link].”.
• The p-value=0.000, which means that the Incomes of the people from different levels of
education are significantly different from each other (p<0.01).
• We do not know which Education Levels are different from each other. Unlike ANOVA,
a Kruskal-Wallis test does not have an easy way to do post hoc analyses. You can follow
that up with a series of two-group comparisons using Mann-Whitney U tests. In this
case, you would follow up Kruskal-Wallis with six Mann-Whitney tests: Primary verses
Secondary, Primary verses Diploma, Primary verses Degree, Secondary Verses Diploma,
Secondary Verses Degree and Diploma Verses Degree.
Comparing Two Independent Variables: Two-Way ANOVA

The Two-Way ANOVA compares the mean differences between groups that have been split on
two independent variables (called factors). You need two independent categorical variables and
one continuous dependent variable.

The following are assumed:

• Dependent variable is either interval or ratio (continuous).


• The dependent variable is approximately normally distributed for each combination of
levels of the two independent variables.
• Homogeneity of variances of the groups formed by the different combinations of levels
of the two independent variables.

Example: A researcher was interested in whether an individual's interest in politics was


influenced by their level of education and their gender. They recruited a random sample of
participants to their study and asked them about their interest in politics, which they scored
from 0 - 100 with higher scores meaning a greater interest. The researcher then divided the
participants by gender (Male/Female) and then again by level of education
(School/College/University).

Setting up the data in SPSS

Let us use two columns representing the two independent variables and label them "Gender"
and "Edu_Level". For "Gender", we coded males as "0" and females as "1", and for
"Edu_Level", we coded school as "1", college as "2" and university as "3". The participants’
interest in politics was entered under the variable name, "Int_Politics".

Testing of Assumptions

To determine whether your dependent variable is normally distributed for each combination of
the levels of the two independent variables see our Testing for Normality guide that runs through
how to test for normality using SPSS using a specific two-way ANOVA example. In SPSS,
homogeneity of variances is tested using Levene's Test for Equality of Variances. This is included
in the main procedure for running the two-way ANOVA, so we get to evaluate whether there is
homogeneity of variances at the same time as we get the results from the two-way ANOVA.

1. Click Analyze > General Linear Model > Univariate...


2. You will be presented with the "Univariate" dialogue box:
3. You need to transfer the dependent variable "Int_Politics" into the "Dependent Variable:"
box and transfer both independent variables, "Gender" and "Edu_Level", into the "Fixed
Factor(s):" box. The result is shown below:

4. Click on the button. You will be presented with the "Univariate: Profile Plots"
dialogue box:
Transfer the independent variable "Edu_Level" from the "Factors:" box into the "Horizontal
Axis:" box and transfer the "Gender" variable into the "Separate Lines:" box. You will be
presented with the following screen:

[Tip: Put the independent variable with the greater number of levels in the "Horizontal Axis:"
box.]

5. Click the button.


You will see that "Edu_Level*Gender" has been added to the "Plots:" box.

6. Click the button. This will return you to the "Univariate" dialogue box.

7. Click the button. You will be presented with the "Univariate: Post Hoc
Multiple Comparisons for Observed..." dialogue box as shown below:
Transfer "Edu_Level" from the "Factor(s):" box to the "Post Hoc Tests for:" box. This will
make the "Equal Variances Assumed" section become active (loose the "grey sheen") and
present you with some choices for which post-hoc test to use. For this example, we are
going to select "Tukey", which is a good, all-round post-hoc test.

[You only need to transfer independent variables that have more than two levels into the
"Post Hoc Tests for:" box. This is why we do not transfer "Gender".]

You will finish up with the following screen:


Click the button to return to the "Univariate" dialogue box.

8. Click the button. This will present you with the "Univariate: Options" dialogue
box as shown below:
9. Transfer "Gender", "Edu_Level" and "Gender*Edu_Level" from the "Factor(s) and
"Factor Interactions:" box into the "Display Means for:" box. In the "Display" section, tick
the "Descriptive Statistics" and "Homogeneity tests" options. You will presented with the
following screen:

Click the button to return to the "Univariate" dialogue box.

10. Click the button to generate the output.


SPSS Output of Two-way ANOVA

SPSS produces many tables in its output from a two-way ANOVA and we are going to start with
the "Descriptives" table as shown below:

This table is very useful as it provides the mean and standard deviation for the groups that have
been split by both independent variables. In addition, the table also provides "Total" rows, which
allows means and standard deviations for groups only split by one independent variable or none
at all to be known.

Levene's Test of Equality of Error Variances

The next table to look at is Levene's Test of Equality of Error Variances as shown below:
From this table we can see that we have homogeneity of variances of the dependent variable
across groups. We know this as the Sig. value is greater than 0.05, which is the level we set for
alpha. If the Sig. value had been less than 0.05 then we would have concluded that the variance
across groups was significantly different (unequal).

Tests of Between-Subjects Effects Table

This table shows the actual results of the two-way ANOVA as shown below:

We are interested in the Gender, Edu_Level and Gender*Edu_Level rows of the table as
highlighted above. These rows inform us of whether we have significant mean differences
between our groups for our two independent variables, Gender and Edu_Level, and for their
interaction, Gender*Edu_Level. We must first look at the Gender*Edu_Level interaction as this
is the most important result we are after. We can see from the Sig. column that we have a
statistically significant interaction at the P = .014 level. You may wish to report the results of
Gender and Edu_Level as well. We can see from the above table that there was no significant
difference in interest in politics between Gender (P = .207) but there were significant differences
between educational levels (P < .0005).

Multiple Comparisons Table

This table shows the Tukey post-hoc test results for the different levels of education as shown
below:
We can see from the above table that there is some repetition of the results but, regardless of
which row we choose to read from, we are interested in the differences between (1) School and
College, (2) School and University, and (3) College and University. From the results we can see
that there is a significant difference between all three different combinations of educational level
(P < .0005).
Plot of the Results

The following plot is not of sufficient quality to present in your reports but provides a good
graphical illustration of your results. In addition, we can get an idea of whether there is an
interaction effect by inspecting whether the lines are parallel or not.
From this plot we can see how our results from the previous table might make sense. Remember
that if the lines are not parallel then there is the possibility of an interaction taking place.

Procedure for Simple Main Effects in SPSS

You can follow up the results of a significant interaction effect by running tests for simple main
effects - that is, the mean difference in interest in politics between genders at each education level.
SPSS does not allow you to do this using the graphical interface you will be familiar with, but
requires you to use syntax. We explain how to do this below:

1. Click File > New > Syntax from the main menu as shown below:
2. You will be presented with the Syntax Editor as shown below:

3. Type text into the syntax editor so that you end up with the following (the colours are
automatically added):
[Depending on the version of SPSS you are using you might have suggestion boxes
appear when you type in SPSS-recognised commands, such as, UNIANOVA. If you are
familiar with using this type of auto-prediction then please feel free to do so, but otherwise
simply ignore the pop-up suggestions and keep typing normally.]

Published with written permission from SPSS Inc, an IBM Company.

Basically, all text you see above that is in CAPITALS, is required by SPSS and does not
change when you enter your own data. Non-capitalised text represents your variables and
will change when you use your own data. Breaking it all down, we have:

Tells SPSS to use the Univariate Anova


UNIANOVA
command

Your dependent variable BY your two


Int_Politics BY Gender Edu_Level independent variables (with a space between
them)

Tells SPSS to calculate estimated marginal


/EMMEANS
means

Generate statistics for the interaction term. Put


TABLES(Gender*Edu_Level) your two independent variables here, separated
by a * to denote an interaction

Tells SPSS to compare the interaction term


COMPARE(Gender)
between genders
4. Making sure that the cursor is at the end of row 2 in the syntax editor click the
button, which will run the syntax you have typed. Your results should appear in the
Output Viewer below the results you have already generated.

SPSS Output of Simple Main Effects

The table you are interested in is the Univariate Tests table:

This table shows us whether there are statistical differences in mean political interest between
gender for each educational level. We can see that there are no statistically significant mean
differences between male and females' interest in politics when individuals are educated to school
(P = .465) or college level (P = .793). However, when individuals are educated to University
level, there are significant differences between males and females' interest in politics (P = .002).

Reporting the results of a two-way ANOVA

You should emphasize the results from the interaction first, before you mention the main effects.
In addition, you should report whether your dependent variable was normally distributed for
each group and how you measured it (we will provide an example below).

A two-way ANOVA was conducted that examined the effect of gender and education level on
interest in politics. Our dependent variable, interest in politics, was normally distributed for the
groups formed by the combination of the levels of education level and gender as assessed by the
Shapiro-Wilk test. There was homogeneity of variance between groups as assessed by Levene's
test for equality of error variances. There was a significant interaction between the effects of
gender and education level on interest in politics, F (2, 54) = 4.643, P = .014. Simple main effects
analysis showed that there males were significantly more interested in politics than females when
educated to University level (P = .002) but there were no differences between gender when
educated to school (P = .543) or college level (P = .793).

[Link]
Correlation: No Causation Implied
If the values of two variables appear to be related to one another, but one is not dependent on
trhe other, they are considered to be correlared. When computing the correlation we need to take
consideration of the following:

• Is therea linear relationship?


• What is the magnitude?
• What direction does it take?
• Is the causation implied?

For example, fish weight and egg production are generally correlated’

• The bigger the fish, the more eggs produced


• The magnitude can be computed when you collect data from a sample of fish
• The direction is definitely positive (the bigger the fish… the more eggs produced)
• But neither variable is dependent on the other. No causation is implied, meaning we have
no reason to suspect that fish weight causes egg number or vice versa.

The correlation coefficient, r provides a quantitative measurement of how closely two variables
are linearly related. It ranges from –Ve 1 to +Ve 1

• +ve corrrelation means an increase in X is associated to an increase in Y


• 0 correlation means and increase in X is not does not associate to any change Y
• -ve corrrelation means an increase in X is associated to a decrease in Y

Caution

• R measures the degree of linear association


• There might be a perfect NON-LINEAR association
• When the p-value > 0.05, corrrelation is NOT significant
• R2 is the coefficient of determination. The proportion of variability in Y accounted for (or
explained by) the variability in X.
• We use the Pearson’s correlation coefficient when measuring parametric data (interval
and ratio data)
• We use Spearman rank order correlation coefficient when measuring non-parametric data
(nominal and ordinal). It is also used when one of the variables is non-parametric while
the other is parametric.

Regression: Causation Implied

Correlations and regressions both test whether two variables are related to each other, and if so,
how closely they are related. Regression is used when you suspect that the two varibles are
causally related, such that variation in one is causing the variation in the other. Regression is
also used when you want to know the degree to which a change in one variable can predict a
change in the other.

If you suspect that a change in one variable is causing a predictable change in another variable,
we can then assume that the two variables have a linear relationship. We can therefore predict
or forecast the occurrence or magnitude of the other variable using the equation of a straight
line: Y = mX+c, where m is the slope (change in Y)/(change in X) and c = intercept (the point
where the line crosses the Y-axis.
[Link]
[Link]

[Link]

Factor Analysis

Factor analysis is a method of data reduction. It does this by seeking underlying unobservable
(latent) variables that are reflected in the observed variables (manifest variables). There are many
different methods that can be used to conduct a factor analysis (such as principal axis factor,
maximum likelihood, generalized least squares, unweighted least squares).

There are also many different types of rotations that can be done after the initial extraction of
factors, including orthogonal rotations, such as varimax and equimax, which impose the
restriction that the factors cannot be correlated, and oblique rotations, such as promax, which
allow the factors to be correlated with one another. You also need to determine the number of
factors that you want to extract. Given the number of factor analytic techniques and options, it
is not surprising that different analysts could reach very different results analyzing the same data
set. However, all analysts are looking for simple structure. Simple structure is pattern of results
such that each variable loads highly onto one and only one factor.

Factor analysis is a technique that requires a large sample size. Factor analysis is based on the
correlation matrix of the variables involved, and correlations usually need a large sample size
before they stabilize. Tabachnick and Fidell (2001) cite Comrey and Lee's (1992) advise regarding
sample size: 50 cases is very poor, 100 is poor, 200 is fair, 300 is good, 500 is very good, and 1000
or more is excellent. As a rule of thumb, a bare minimum of 10 observations per variable is
necessary to avoid computational difficulties.

For the example below, we are going to do a rather "plain vanilla" factor analysis. We will use
iterated principal axis factor with three factors as our method of extraction, a varimax rotation,
and for comparison, we will also show the promax oblique solution. The determination of the
number of factors to extract should be guided by theory, but also informed by running the
analysis extracting different numbers of factors and seeing which number of factors yields the
most interpretable results.

In this example we have included many options, including the original and reproduced
correlation matrix, the scree plot and the plot of the rotated factors. While you may not wish to
use all of these options, we have included them here to aid in the explanation of the analysis. We
have also created a page of annotated output for a principal components analysis that parallels
this analysis.

Procedures

• Select Analyze→Data Reduction→Factor


• Place the variables you want to analyze in the Variables area
Generating Descriptives
• Click on Descriptives and select a number of options
• The Coefficients option produces the R-matrix, and the Significance levels option will
produce a matrix indicating the significance levels of each correlation in the R-matrix.
• You can also ask for the determinant of the R-matrix and this option is vital for testing
for multi-collinearity or singularity.
• The determinant of R-matrix should be greater than 0.00001; if it is less than this value
then look through the correlation matrix for variables that correlate very highly (R>0.8)
and consider removing one of the variables (or more depending on the extent of the
problem) before proceeding. The choice of which of the two variables to remove will be
fairly arbitrary and finding multi-collinearity in the data should raise questions about the
choice of items within the questionnaire.
• KMO and Bartlett’s test of sphericity produces the Keiser-Mayer-Olkin measure of
sampling adequacy and Bartlett’s test. The value of KMO should be greater than 0.5 if
the sample is adequate.
Factor Extraction on SPSS
• Click on Extraction to access the extraction dialog box
• There are many methods of extraction but we shall use the Principal Component
Analysis method
• The Display box has two options; to display the unrotated factor solution and the scree
plot.
The unrotated factor solution is useful in assessing the improvement in interpretation
due to rotation.
The scree plot is a useful in establishing how many factors should be left in an analysis.
• If the rotated solution is a little better than the unrotated solution then it is possible that
an inappropriate rotation method has been used.
• The Extract box provides options pertaining to the retention of factors. You have the
choice of selecting factors with eigenvalues greater than a user specified value or retaining
a fixed number of factors. Kaiser’s recommendation of eigenvalues is over 1.
• It is advisable to run a primary analysis with eigenvalues over 1 option selected, select a
scree plot and compare the results. If looking at the scree plot and eigenvalues over 1 lead
you to retain the same number of factors then continue. If the two criteria give different
results then examine the communalities and decide yourself which of the two criteria to
believe.
Rotation
The interpretability of factors can be improved through rotation. Rotation maximizes the
loading of each variable on one of the extracted factors whilst minimizing the loading on
all the other factors.
• Click Rotation to display the dialogue box.

[Link]

[Link]


Pearson's Product-Moment Correlation using SPSS

The Pearson product-moment correlation coefficient is a measure of the strength and direction
of association that exists between two variables measured on at least an interval scale. It is
denoted by the symbol r.

Assumptions

• Variables are measured at the interval or ratio level (continuous).


• Variables are approximately normally distributed.
• There is a linear relationship between the two variables.
• Pearsons's r is sensitive to outliers so it is best if outliers are kept to a minimum or
there are no outliers, at all!

Example

A researcher wishes to know whether a person's monthly Income is related to his/her monthly
savings. We shall use the Savings data file (SPSS Exercise). The researcher then investigates
whether there is an association between Income and Savings.

Testing assumptions

Your variables need to be normally distributed. Pearson's r is also very susceptible to outliers in
the data so you need to test for outliers. What if your samples are not normally distributed or
there are outliers? If your samples violate the assumption of normality or have outliers then you
might need to consider using a non-parametric test such as Spearman's Correlation.

Test Procedure in SPSS

1. Click Analyze > Correlate > Bivariate... on the menu system as shown below:
2. Transfer the variables "Income" and "Savings" into the "Variables:" box by dragging-and-

dropping or by clicking the button.


3. Make sure that the Pearson tickbox is checked under the "Correlation Coefficients" group
(although it is selected by default in SPSS).
4. Click the button. If you wish to generate some descriptives you can do it here by

clicking on the particular tickbox. Then click the button.

5. Click the button.

Output

You will be presented with the Correlations table in the output viewer as below:
Correlations

INCOME SAVING

INCOM Pearson
1 .767**
E Correlation

Sig. (2-tailed) .000

N 120 120

SAVING Pearson
.767** 1
Correlation

Sig. (2-tailed) .000

N 120 120
**. Correlation is significant at the 0.01 level (2 tailed).

The results are presented in a matrix such that, as can be seen above, the correlations are
replicated. Nevertheless, the table presents the Pearson correlation coefficient, the significance
value and the sample size that the calculation is based on. In this example, we can see that the
Pearson correlation coefficient, r, is 0.767 and that this is statistically significant (p < 0.05).

Understanding the Output

In our example you might present the results are follows:

A Pearson product-moment correlation was run to determine the relationship between an


individual's monthly Income and their monthly Savings. The data showed no violation of
normality, linearity or homoscedasticity. There was a strong, positive correlation between
Income and Savings, which was statistically significant (r = .767, n = 120, p <0 .05).

Spearman's Rank Order Correlation using SPSS

The Spearman Rank Order Correlation coefficient, rs, is a non-parametric measure of the strength
and direction of association that exists between two variables measured on at least an ordinal
scale. It is denoted by the symbol rs (or the greek letter ,pronounced rho). The test is used for
either ordinal variables or for interval data that has failed the assumptions necessary for
conducting the Pearson's product-moment correlation.

Assumptions

• Variables are measured on an ordinal, interval or ratio scale.


• Variables need NOT be normally distributed.
• There is a monotonic relationship between the two variables, i.e. either the variables
increase in value together or as one variable value increases the other variable value.
• This type of correlation is NOT very sensitive to outliers.

Example

A teacher is interested in those who do the best at Science also do better in Maths (assessed by
exam). She records the scores of her students as they performed in end-of-year examinations for
both Science and Maths.

Test Procedure in SPSS

1. Click Analyze > Correlate > Bivariate... on the menu system as shown below:
2. Transfer the variables "Math Score" and "Science Score" into the "Variables" box by

dragging-and-dropping or by clicking the button.


3. Make sure that you uncheck the Pearson tickbox (it is selected by default in SPSS) and
check the Spearman tickbox under the "Correlation Coefficients" group.

4. Click the button.

Output

You will be presented with the following output table under the title "Correlations":
Correlations

math score science score

Spearman's rho math score Correlation


1.000 .640**
Coefficient

Sig. (2-tailed) . .000

N 200 200

science score Correlation


.640** 1.000
Coefficient

Sig. (2-tailed) .000 .

N 200 200
**. Correlation is significant at the 0.01 level (2-
tailed).

The results are presented in a matrix such that, as can be seen, the correlations are replicated.
Nevertheless, the table presents Spearman's Rank Order Correlation, its significance value and
the sample size that the calculation was based on. In this example, we can see that Spearman's
correlation coefficient, rs, is 0.640 and that this is statistically significant (P = 0.05).

Reporting the Output

In our example you might present the results are follows: A Spearman's Rank Order correlation
was run to determine the relationship between 200 students' science and maths exam marks.
There was a strong, positive correlation between Science scores and Maths scores, which was
statistically significant (rs = .640, P = .000).

Linear Regression Analysis using SPSS

Regression analysis is the next step up after correlation; it is used when we want to predict the
value of a variable based on the value of another variable. In this case, the variable we are using
to predict the other variable's value is called the independent variable or sometimes the predictor
variable. The variable we are wishing to predict is called the dependent variable or sometimes
the outcome variable.
Example

A researcher may be interested in determining whether there is a relationship between an


individual's income and their Savings, as in the example of correlations.

Assumptions

• Variables are measured at the interval or ratio level (continuous)


• Variables are approximately normally distributed
• There is a linear relationship between the two variables.

Procedure

1. Click Analyze > Regression > Linear... on the top menu.


2. Transfer the independent (predictor) variable, Income, into the "Independent(s):" box and
the dependent (outcome) variable, Savings, into the "Dependent:"

3. Click the button.

Output of Linear Regression Analysis

SPSS will generate quite a number of tables in its results section for a linear regression. In this
session, we are going to look at the important tables. The first table of interest is the Model
Summary table. This table provides the R and R2 value. The R value is 0.767, which represents
the simple correlation and, therefore, indicates a high degree of correlation. The R2 value
indicates how much of the dependent variable, Savings, can be explained by the independent
variable, income. In this case, 58.8% can be explained, which is quite large.

Model Summary

Adjusted R Std. Error of


Model R R Square Square the Estimate

1 .767a .588 .585 1584.296


a. Predictors: (Constant), INCOME

The next table is the ANOVA table. This table indicates that the regression model predicts the
outcome variable significantly well. How do we know this? Look at the "Regression" row and go
to the Sig. column. This indicates the statistical significance of the regression model that was
applied. Here, P = 0.000 which is less than 0.05 and indicates that, overall, the model applied is
significantly good enough in predicting the outcome variable.

ANOVAb

Sum of
Model Squares df Mean Square F Sig.

1 Regression 4.228E8 1 4.228E8 168.454 .000a

Residual 2.962E8 118 2509992.867

Total 7.190E8 119


a. Predictors: (Constant), INCOME
b. Dependent Variable: SAVING
M Company.

The table below, Coefficients, provides us with information on each predictor variable. This
provides us with the information necessary to predict savings from income. We can see that both
the constant and income contribute significantly to the model (by looking at the Sig. column).
By looking at the B column under the Unstandardized Coefficients column we can present the
regression equation as:

Savings = 0.006(Income) – 2,059

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients

Model B Std. Error Beta t Sig.

1 (Constant) -2059.174 352.820 -5.836 .000

INCOME .006 .000 .767 12.979 .000


a. Dependent Variable: SAVING

Chi-Square Test for Association using SPSS

Objective
The Chi-Square test for independence, also called Pearson's Chi-square test or the Chi-square
test of association is used to discover if there is a relationship between two categorical variables.

Example

Educators are always looking for novel ways in which to teach statistics to undergraduates as
part of a non-statistics degree course, e.g. psychology. With current technology it is possible to
present how-to guides for statistical programs online instead of in a book. However, different
people learn in different ways. An educator would like to know whether gender (male/female) is
associated with the preferred type of learning medium (online vs. books). We therefore have two
nominal variables: Gender(male/female) and Preferred Learning Medium (online/books).

Assumptions

• Two variables that are ordinal or nominal (categorical data).


• There are two or more groups in each variable.

Test Procedure

1. Click Analyze → Descriptives Statistics→ Crosstabs...


2. You will be presented with the following:
3. Transfer one of the variables into the "Row(s):" box and the other variable into the
"Column(s):" box. In our example we will transfer the "Gender" variable into the
"Row(s):" box and "Preferred_Learning" into the "Column(s):" box.
4. If you want to display clustered bar charts (recommended) then make sure that "Display
clustered bar charts" checkbox is ticked.

5. Click on the button. Select the "Chi-square" and "Phi and Cramer's V" options.

6. Click the button.

7. Click the button. Select "Observed" from the "Counts" area and "Row",

"Column" and "Total" from the "Percentages" area. Click the button.

8. Click the button. [This next option is only really useful if you have more than
two categories in one of your variables but we will show it here in case you have]

You will be presented with the following:


This option allows you to change the order of the values to either ascending or
descending.

Once you have made your choice click the button.

9. Click the button to generate your output.

Output

You will be presented with some tables in the Output Viewer under the title "Crosstabs". The
tables of note are presented below:

The Crosstabulation Table (Gender*Preferred Learning Medium Crosstabulation)


This table allows us to understand that both males and females prefer to learn using online
materials vs. books.

The Chi-Square Tests Table

When readings this table we are interested in the results for the Continuity correction. We can
see here that Chi-square(1) = 0.487, P = 0.485. This tells us that there is no statistically
significant association between Gender and Preferred Learning Medium. That is, both Males and
Females equally prefer online learning vs. books. If you had a 2 x 2 contingency table and small
numbers then ......

The Symmetric Measures Table

Phi and Cramer's V are both tests of the strength of association. We can see that the strength of
association between the variables is very weak.

Bar Chart
It can be easier to visualize data than read tables. The clustered bar chart option allows a relevant
graph to be produced that highlights the group categories and the frequency of counts in these
groups.
Multi-linear Regression Analysis Example (Using the Science Data File)

Variables in the model

c. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

d. Variables Entered - SPSS allows you to enter variables into a regression in blocks, and it
allows stepwise regression. Hence, you need to know which variables were entered into the
current regression. If you did not block your independent variables or use stepwise regression,
this column should list all of the independent variables that you specified.

e. Variables Removed - This column listed the variables that were removed from the current
regression. Usually, this column will be empty unless you did a stepwise regression.

f. Method - This column tells you the method that SPSS used to run the regression. "Enter"
means that each independent variable was entered in usual fashion. If you did a stepwise
regression, the entry in this column would tell you that.

Overall Model Fit


b. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

c. R - R is the square root of R-Squared and is the correlation between the observed and predicted
values of dependent variable.

d. R-Square - This is the proportion of variance in the dependent variable (science) which can
be explained by the independent variables (math, female, socst and read). This is an overall
measure of the strength of association and does not reflect the extent to which any particular
independent variable is associated with the dependent variable.

e. Adjusted R-square - This is an adjustment of the R-squared that penalizes the addition of
extraneous predictors to the model. Adjusted R-squared is computed using the formula 1 - ((1 -
Rsq)((N - 1) /( N - k - 1)) where k is the number of predictors.

f. Std. Error of the Estimate - This is also referred to as the root mean squared error. It is the
standard deviation of the error term and the square root of the Mean Square for the Residuals in
the ANOVA table (see below).

Anova Table
c. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

d. Regression, Residual, Total - Looking at the breakdown of variance in the outcome variable,
these are the categories we will examine: Regression, Residual, and Total. The Total variance is
partitioned into the variance which can be explained by the independent variables (Model) and
the variance which is not explained by the independent variables (Error).

e. Sum of Squares - These are the Sum of Squares associated with the three sources of variance,
Total, Model and Residual. The Total variance is partitioned into the variance which can be
explained by the independent variables (Regression) and the variance which is not explained by
the independent variables (Residual).

f. df - These are the degrees of freedom associated with the sources of variance. The total variance
has N-1 degrees of freedom. The Regression degrees of freedom corresponds to the number of
coefficients estimated minus 1. Including the intercept, there are 5 coefficients, so the model has
5-1=4 degrees of freedom. The Error degrees of freedom is the DF total minus the DF model,
199 - 4 =195.

g. Mean Square - These are the Mean Squares, the Sum of Squares divided by their respective
DF.

h. F and Sig. - This is the F-statistic the p-value associated with it. The F-statistic is the Mean
Square (Regression) divided by the Mean Square (Residual): 2385.93/51.096 = 46.695. The p-
value is compared to some alpha level in testing the null hypothesis that all of the model
coefficients are 0

Parameter Estimates
b. Model - SPSS allows you to specify multiple models in a single regression command. This
tells you the number of the model being reported.

c. This column shows the predictor variables (constant, math, female, socst, read). The first
variable (constant) represents the constant, also referred to in textbooks as the Y intercept, the
height of the regression line when it crosses the Y axis. In other words, this is the predicted value
of science when all other variables are 0.

d. B - These are the values for the regression equation for predicting the dependent variable from
the independent variable. The regression equation is presented in many different ways, for
example:

Ypredicted = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4

The column of estimates provides the values for b0, b1, b2, b3 and b4 for this equation.

math - The coefficient for math is .389. So for every unit increase in math, a 0.39 unit increase
in science is predicted, holding all other variables constant.
female - For every unit increase in female, we expect a -2.010 unit decrease in the science score,
holding all other variables constant. Because female is coded 0/1 (0=male, 1=female), the
interpretation is easy: for females, the predicted science score would be 2 points lower than for
males.
socst - The coefficient for socst is .050. So for every unit increase in socst, we expect an
approximately .05 point increase in the science score, holding all other variables constant.
read - The coefficient for read is .335. So for every unit increase in read, we expect a .34 point
increase in the science score.

e. Std. Error - These are the standard errors associated with the coefficients.

f. Beta - These are the standardized coefficients. These are the coefficients that you would obtain
if you standardized all of the variables in the regression, including the dependent and all of the
independent variables, and ran the regression. By standardizing the variables before running the
regression, you have put all of the variables on the same scale, and you can compare the
magnitude of the coefficients to see which one has more of an effect. You will also notice that the
larger betas are associated with the larger t-values and lower p-values.

g. t and Sig. - These are the t-statistics and their associated 2-tailed p-values used in testing
whether a given coefficient is significantly different from zero. Using an alpha of 0.05:
The coefficient for math (0.389) is significantly different from 0 because its p-value is 0.000,
which is smaller than 0.05.
The coefficient for female (-2.010) is not significantly different from 0 because its p-value is
0.051, which is larger than 0.05.
The coefficient for socst (0.0498443) is not statistically significantly different from 0 because its
p-value is definitely larger than 0.05.
The coefficient for read (0.3352998) is statistically significant because its p-value of 0.000 is less
than .05.
The intercept is significantly different from 0 at the 0.05 alpha level.

h. 95% Confidence Limit for B Lower Bound and Upper Bound - These are the 95%
confidence intervals for the coefficients. The confidence intervals are related to the p-values such
that the coefficient will not be statistically significant if the confidence interval includes 0. These
confidence intervals can help you to put the estimate from the coefficient into perspective by
seeing how much the value could vary.
Cronbach's Alpha (α) using SPSS

Cronbach's alpha is the most common measure of internal consistency ("reliability"). It is most
commonly used when you have multiple Likert questions in a survey/questionnaire that form a
scale and you wish to determine if the scale is reliable.
Example

A researcher has devised a nine-question questionnaire with which they hope to measure how
safe people feel at work at an industrial complex. Each question was a 5-point Likert item from
"strongly disagree" to "strongly agree". In order to understand whether the questions in this
questionnaire all reliably measure the same latent variable (feeling of safety) (so a Likert scale
could be constructed), a Cronbach's alpha was run on a sample size of 15 workers.
Setup in SPSS

The nine questions have been labelled "Qu1" through to "Qu9"..


Test Procedure in SPSS

1. Click Analyze → Scale→ Reliability Analysis...


2. You will be presented with the Reliability Analysis dialogue box:
3. Transfer the variables "Qu1" to "Qu9" into the "Items:" box.
4. Leave the "Model:" set as "Alpha", which represents Cronbach's alpha in SPSS. If you want
to provide a name for the scale enter it in the "Scale label:" box. Since this only prints the
name you enter at the top of the SPSS output, it is certainly not essential that you do; and
in this case we will leave it blank.

5. Click on the button, which will present the Reliability Analysis: Statistics
dialogue box
6. Select the "Item", "Scale" and "Scale if item deleted" in the "Descriptives for" box and
"Correlations" in the "Inter-Item" box, as shown below:

7. Click the button. This will return you to the Reliability Analysis dialogue box.

8. Click the button to generate the output.


SPSS Output for Cronbach's Alpha

SPSS produces many different tables. The first important table is the Reliability Statistics table
that provides the actual value for Cronbach's alpha, as shown below:

We can see that in our example, Cronbach's alpha is 0.805, which indicates a high level of internal
consistency for our scale with this specific sample.
Item-Total Statistics

The Item-Total Statistics table presents the Cronbach's Alpha if Item Deleted in the final
column, as shown below:

This column presents the value that Cronbach's alpha would be if that particular item was deleted
from the scale. We can see that removal of any question except question 8, would result in a
lower Cronbach's alpha. Therefore, we would not want to remove these questions. Removal of
question 8 would lead to a small improvement in Cronbach's alpha and we can also see that the
Corrected Item-Total Correlation value was low (0.128) for this item. This might lead us to
consider whether we should remove this item.
Cronbach's alpha simply provides you with an overall reliability coefficient for a set of variables,
e.g. questions. If your questions reflect different underlying personal qualities (or other
dimensions), for example, employee motivation and employee commitment, then Cronbach's
alpha will not be able to distinguish between these. In order to do this and then check their
reliability (using Cronbach's alpha), you will first need to run a test such as a principal
components analysis (PCA).
[Link]

[Link]

[Link]
Hierarchical Multiple Regression in SPSS
Hierarchical Linear Modeling (HLM) is a complex form of ordinary least squares (OLS)
regression that is used to analyze variance in the outcome variables when the predictor variables
are at varying hierarchical levels. Hierarchical multiple regression, a variant of the basic
multiple regression procedure that allows you to specify a fixed order of entry for variables in
order to control for the effects of covariates or to test the effects of certain predictors independent
of the influence of others.
As with standard multiple regression, the basic command for this procedure is “Regression
→“Linear” from the SPSS menu.
In the main dialog box, input the dependent variable (for example Saving). We want to predict
whether Age, Distance, Family Size and Income are predictors of Savings.

Enter a set of predictor variables (Location, Sex & Education) into the Independent box. These
are the variables that you want SPSS to put into the model first – generally the ones that you
want to control for when testing the other set of variables (Age, Distance, Family Size and
Income) that you are truly interested in. To make sure that these variables do not explain away
the entire association between Savings and Age, Distance, Family Size and Income, they are put
them into the model first. This ensures that they will get “credit” for any shared variability that
they may have with the predictor that we’re really interested in, Savings. Any observed effect of
Age, Distance, Family Size and Income can then be said to be “independent of” the effects of these
variables that we have already controlled for.

The next step is to input the variable that we’re really interested in, which is Savings.
To put it into the model, click the “next” button. You will see all of the predictor variables that
you previously entered disappear – don’t panic! They are still in the model, just not on the current
screen.

Click on the “OK” button to run the analysis. You could also hit “Next” again, if you wanted to
enter a third (or fourth, or fifth, etc.) block of variables. Often researchers will enter variables as
related sets – for example, all demographic variables in a first step, all potentially confounding
psychological variables in a second step, and then the variable that you are most interested in as
a third step. This is not necessarily the only way to proceed – you could also enter each variable
as a separate step if that seems more logical based on the design of your experiment.
SPSS also lets you specify a “Method” for each step – for example, if you wanted to know which
demographic predictors were most effective, you could use a stepwise procedure in the first block,
and then still enter employment status in the second block. To do this, you would just choose
“Forward Stepwise” instead of “Enter” from the drop-down box labeled “Method.”
Using just the default “Enter” method, with all the variables in Block 1 (demographics) entered
together, followed by Age, Distance, Family Size and Income as a predictor in Block 2, we get
the following output:
Variables Entered/Removeda

Model Variables Entered Variables Removed Method

EDUC_LV, LOCATION,
1 . Enter
SEXb

AGE, FSIZE, INCOME,


2 . Enter
DISTANCEb

a. Dependent Variable: SAVING

b. All requested variables entered.

The table above confirms which variables were entered in each step – the three demographic
variables in step 1, and Age, Distance, Family Size and Income in step 2.

Model Summary
Model R R Square Adjusted R Square Std. Error of the
Estimate
1 .289a .083 .060 2383.424
2 .854b .730 .713 1316.824
a. Predictors: (Constant), EDUC_LV, LOCATION, SEX
b. Predictors: (Constant), EDUC_LV, LOCATION, SEX, AGE, FSIZE, INCOME,
DISTANCE
The next table shows you the percent of variability in the dependent variable that can be
accounted for by all the predictors together (that’s the interpretation of R-square).
At step 1, demographics accounted for only 8.3%. The change in R2 is a way to evaluate how
much predictive power was added to the model by the addition of Age, Distance, Family Size and
Income in step 2. In this case, the % of variability accounted for went up from 8.3% to 73.0% – a
very huge increase.

ANOVAa

Model Sum of Squares df Mean Square F Sig.

Regression 60035328.319 3 20011776.106 3.523 .017b

1 Residual 658962618.348 116 5680712.227

Total 718997946.667 119

Regression 524787044.244 7 74969577.749 43.234 .000c

2 Residual 194210902.422 112 1734025.914

Total 718997946.667 119

a. Dependent Variable: SAVING

b. Predictors: (Constant), EDUC_LV, LOCATION, SEX

c. Predictors: (Constant), EDUC_LV, LOCATION, SEX, AGE, FSIZE, INCOME,


DISTANCE

The table above shows that the first model (demographic variables alone) predicts the scores to
a statistically significant degree (p<0.05(=0.017). Furthermore the table shows that second
model (demographics plus Age, Distance, Family Size and Income predicted scores on the DV to
a statistically significant degree (p<0.05(=0.000).
If the first set of predictors was significant, but the second wasn’t, it would mean that Age,
Distance, Family Size and Income did not have an effect above and beyond the effects of
demographics. But in this case Age, Distance, Family Size and Income have an effect above and
beyond the effects of demographics, this implies that one or more of the predictors are significant.

Coefficientsa
Model Unstandardized Coefficients Standardized t Sig.
Coefficients
B Std. Error Beta
(Constant) 1855.901 704.687 2.634 .010
LOCATION -283.893 151.756 -.168 -1.871 .064
1
SEX 666.472 499.371 .132 1.335 .185
EDUC_LV 357.736 290.347 .122 1.232 .220
(Constant) 131.747 611.969 .215 .830
LOCATION -183.474 85.297 -.108 -2.151 .034
SEX 54.171 290.405 .011 .187 .852
EDUC_LV -170.821 461.987 -.058 -.370 .712
2
DISTANCE -15.467 82.229 -.033 -.188 .851
AGE 18.701 11.355 .087 1.647 .102
FSIZE -256.851 34.569 -.393 -7.430 .000
INCOME .006 .001 .707 11.189 .000
a. Dependent Variable: SAVING

The table above shows that all the demographic variables (other than the constant) were not
statistically significant predictor of the DV – Savings (p>0.05(Location =0.064, Sex=0.185,
Educ=0.220). Furthermore the table shows that second model (demographics plus Age, Distance,
Family Size and Income) has Location, Family size and Income as predictors of Savings (p<0.05)
This means that a unit increase in Family size would translate into 0.393 unit decreases in
Savings and a unit increase in Income would translate into 0.7.7 unit increases in Savings.

You might also like