Data Preparation &
Analysis
with
SPSS
Data Preparation Process
Check Questionnaire
Edit
Code
Transcribe
Clean Data
Data Analysis
Questionnaire Checking
A questionnaire returned from the field may be
unacceptable for several reasons.
Editing
Editing the questionnaires involves identifying illegible, incomplete,
inconsistent or ambiguous responses:
Treatment of Unsatisfactory Results
– Returning to the Field
– Assigning Missing Values
– Discarding Unsatisfactory Respondents
Coding
Coding means assigning a code, usually a number, to each possible
response to each question. The code includes an indication of
– the column position (field) e.g. sex of a respondent
– data record that includes related fields such as sex, marital status,
age, income etc.
Coding Questions
• Fixed field codes, which mean that the number of records for each
respondent is the same and the same data appear in the same
column(s) for all respondents, are highly desirable.
• If possible, standard codes should be used for missing data. Coding of
structured questions is relatively simple, since the response options are
predetermined.
• In questions that permit a large number of responses, each possible
response option should be assigned a separate column.
Coding
Guidelines for coding unstructured questions:
• Category codes should be mutually exclusive and collectively
exhaustive.
• Only a few (10% or less) of the responses should fall into the “other”
category.
• Category codes should be assigned for critical issues even if no one
has mentioned them.
• Data should be coded to retain as much detail as possible.
Coding : Codebook
A codebook contains coding instructions and the necessary
information about variables in the data set. A codebook generally
contains the following information:
• column number
• record number
• variable number
• variable name
• question number
• instructions for coding
Data Cleaning
Consistency Checks
Consistency checks : identify data that are out of range, logically
inconsistent, or have extreme values.
Selecting a Data Analysis Strategy
Earlier Steps of the Research Process
Known Characteristics of the Data
Properties of Statistical Techniques
Background and Philosophy of the Researcher
Data Analysis Strategy
• Metric Data- Data that are on interval or ratio scale
• Non-metric Data- Data that are on nominal or ordinal scale
• Univariate Techniques- Statistical techniques appropriate for analysing
data when there is single measurement of each element in the sample.
• Multivariate Techniques- Statistical techniques appropriate for analysing
data when there are two or more measurements on each element in the
sample. It tells simultaneous relationship between two or more
phenomenon.
– Dependence Techniques- When one or more of the variables can be
identified as dependent variable & the remaining as independent
variables.
– Interdependence Techniques- The techniques that attempt to group
data based on underlying similarity. No distinction is made as to which
variables are dependent/ independent.
A Classification of Univariate Techniques
Univariate Techniques
Metric Data Non-numeric Data
One Sample Two or More One Sample Two or More
Samples Samples
* t test * Frequency
* Z test * Chi-Square
* K-S
* Runs
* Binomial
Independent Related
* Two- Group * Paired Independent Related
test t test
* Z test * Chi-Square
* One-Way * Sign
* Mann-Whitney * Wilcoxon
ANOVA * Median * McNemar
* K-S * Chi-Square
* K-W ANOVA
A Classification of Multivariate Techniques
Multivariate Techniques
Dependence Interdependence
Technique Technique
One Dependent More Than One Variable Interobject
Variable Dependent Interdependence Similarity
Variable
* Cross- * Multivariate * Factor * Cluster Analysis
Tabulation Analysis of Analysis * Multidimensional
* Analysis of Variance and Scaling
Variance and Covariance
Covariance * Canonical
* Multiple Correlation
Regression * Multiple
* Conjoint Discriminant
Analysis Analysis
Type I Error & Type II Error
• A Type I error (α) is the mistake of rejecting the null hypothesis when it is true.
• A Type II error (β) is the mistake of failing to reject the null hypothesis when it is false.
Machine is working erroneously. But, it is assumed
to be working accurately and hence, it will fill in
wrongly causing loss to company & customers.
Ho (True) Ho(False)
Accept Ho Correct Decision Type II Error (β)
(1- α)
Reject Ho
Type I Error (α) Correct Decision
(1- β )
Machine is working accurately. But,
it is assumed to be working
erroneously and hence filling will
be 23 April 2018
stopped & mechanic is called.
Hypotheses Testing
• Level of Significance (α): Risk that a researcher is willing to take of rejecting the null hypotheses when it
happens to be true. It is probability of making a Type I error (α). The higher the significance level, the higher
the probability of rejecting a null hypothesis when its true.
• Critical Region: It is the rejection region. If the value of mean falls within this region, the null hypothesis is
rejected.
• Critical value: The value of a test statistic beyond which the null hypothesis can be rejected.
• Power of Test (1- β): It is the ability of a test to reject a false null hypothesis. The probability of supporting
an alternative hypothesis that is true. High value of 1- β(near 1) means test is working fine, it is rejecting a
null hypothesis when it is false.
• One-Tailed Test : If null hypothesis is rejected only for values of the test statistic falling into one specified
tail of its sampling distribution.
• Two-Tailed Test: If the null hypothesis is rejected for values of the test statistic falling into either tail of its
sampling distribution. A deviation in either direction would reject the null hypothesis. Normally α is divided
into α/2 on one side and α/2 on the other.
One Tailed & Two Tailed Test
• A manufacturer of a light bulb wants to produce bulbs with a mean life of 1000
hours. If the lifetime is shorter, he will lose customers to the competitors; if the
lifetime is longer, he will have a very high production cost because the filaments will
be very thick. Determine the type of test.
• The wholesaler buys bulbs in large lots & does not want to accept bulbs unless
their mean life is at least 1000 hours. Determine the type of test.
One Tailed Test
Two Tailed Test
Univariate
Data
Analysis
t-tests (Cases)
One sample t-test : To test if mean of a distribution differs significantly from
some preset value
For the given [Link] file, find if the final marks scored by students differ
significantly from the Professor’s goal of class average of 60. Design
hypothesis & test it.
t-tests (Cases)
Independent sample t-test : To test if means of a distribution of two samples
differs significantly from each other
If there are 15 customers of our brand each in Mumbai & Delhi, and they
are asked to rate our brand on a 7 point scale. 1= most disliked & 7 = most
liked.
The ratings by these 30 customers from two cities are mentioned next.
Develop a hypothesis to test if ratings by two cities are different. Also test
the hypothesis.
t-tests (Cases)
Paired sample t-test : To test if two measurements on the same sample differ
significantly
If there are 18 customers of Passion brand of garments. This set of
customers is to be monitored for their attitude towards Passion brand
before and after release of an advertising campaign. The attitude is to be
measured on a 10 point scale. 1= highly disliked, 10= highly liked.
The ratings by these 18 customers before and after the advertising
campaign are mentioned next. Develop a hypothesis to test if these ratings
by customers are different. Also test the hypothesis.
ANOVA
• Whereas t-tests compare only two distributions, analysis of variance is able to
compare many. E.g. if in case of MARKS file, we want to see whether Quiz1 scores by
men and women are different i.e. who (men or women) score higher in the quiz, a t-
test is appropriate.
If, however, we wish to see whether any of the five different ethnic groups’ scores
differ significantly from each other on the same quiz, it would require one way analysis
of variance to accomplish it.
One way ANOVA means:
Exactly one dependent variable (Continuous) e.g. quiz1 scores, here
Exactly one independent variable (Categorical) e.g. ethnicity, here, with 5 level
Two (Three) way ANOVA means: Exactly one dependent variable &
Exactly two (three) independent variable
MANOVA: Multiple dependent variables & multiple independent variables
One-Way ANOVA
• File # [Link]
Dependent variable – Quiz 1 scores
Independent variable – Ethnicity (with 5 levels)
– Ho: There is no difference among students with different
ethnicities as far as quiz1 marks scored by them is concerned.
– H1: There is significant difference among students with
different ethnicities as far as quiz1 marks scored by them is
concerned.
Chi-Square Test
Graduation background of MBA students & their performance in terms of
grade is given below:
Education Background:
• [Link] (1)
• B.E. (2)
• [Link]. (3)
Ho: Graduation background of MBA students does not
• B.B.A. (4)
influence their performance in terms of grade.
• B.A. (5)
Ha: Graduation background of MBA students influence their
Grade Codes: performance in terms of grade.
• A (1)
• B (2)
• C (3)
Correlation (r)
• Degree of association between two sets of quantitative data e.g. how crop
production is correlated with rainfall?
• r varies from -1 to +1; r=0 (no correlation); r= (+/-)1 (perfect correlation)
Bivariate Correlation: Correlation between two variables
• File # [Link]
• To produce correlation matrix of gender, gpa & final
Partial Correlation: Process of finding correlation between two variables after
the influence of other variables has been controlled for.
Regression
• Regression explains variation in one variable (dependent variable) based on the
variation in one or more other variables (independent variables)
• Simple regression: one dependent & one independent variable
• Multiple regression: one dependent & more than one independent variables
• File # [Link]
It is dersired to study the effect that six different conditions (independent variables)
have on yield per hectre for a crop of wheat. The research was conducted by
accumulating data from fifteen major states in India
The six independent variables are;
X1= Rainfall (in cms)
X2= Soil type (1, low quality to 5, high quality)
X3= Quantity of fertilizer (in quintal/ sq. km of land)
X4= Land percentage being irrigated by State Agri. Deptt.
X5= Seed quality (1, low quality to 5, high quality)
X6= Percentage of automation in cultivation process
Dependent variable is Y= yield per hectre in quintals
Regression
We need to determine:
1. Is model a good fit? From ANOVA table (F-value)
2. What % of variation in dependent variable is explained by independent variables?
From Model Summary (Adjusted R square)
3. Which independent variables are good explanatory variables of dependent variable?
From Coefficients (t-values)
4. Regression Equation