0% found this document useful (0 votes)
5 views19 pages

Data Analysis Lec 4

The document discusses crosstabulations, chi-square tests, and various graph types used for data analysis, emphasizing their application in exploring relationships between categorical variables. It also addresses the identification and management of outliers in datasets, including methods for detection and potential solutions for handling them. Key statistical concepts and visualizations are highlighted to aid in research and analysis.

Uploaded by

Victor Mosioma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views19 pages

Data Analysis Lec 4

The document discusses crosstabulations, chi-square tests, and various graph types used for data analysis, emphasizing their application in exploring relationships between categorical variables. It also addresses the identification and management of outliers in datasets, including methods for detection and potential solutions for handling them. Key statistical concepts and visualizations are highlighted to aid in research and analysis.

Uploaded by

Victor Mosioma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Crosstabs

• Crosstabultions display a joint distribution of two or more categorical


variables
• Crosstabulations are commonly used to explore how demographic
characteristics are related to attitudes and behaviors.
• In general crosstabs are used to study the relationships between two or more
categorical variables
• Crosstab assumptions: variables used in crosstabs must be categorical
(nominal or ordinal)
Generating crosstabs
• Select variables for the crosstabs procedure, atleat one for the row and one
for the column dimension.
• Select percentage options
• Review the procedure output to investigate the relationship between the
variables including cell counts and percentages.
Chi-square output
value df Asymp. Sig (2-sided)
Pearson Chi-square 150.134a 8 .000
Likelyhood ratio 3.049 3 .000
Linear-by-linear 117.739 1 .000
association
N of valid cases 10

• In terms of formal testing, we often use significance values of .05 or .01 to


determine whether we should reject the null hypothesis or not
Chi-square test
• The chi-square test normally assumes that there is a null hypothes about the
relationships of the two variables to be tested. A null hypothesis normally
states that there is no relationship between the varaibles
• The significance value helps us to either accept or reject the null hypothesis
Graphs
• Graphs are used to visualize data distributions, identify trends, compare categories
and explore relationships between variables for research, analysis and presentations.
The following graphs can be generated in SPSS to visualize data:
Ø Bar chart
Ø Line chart
Ø Pie/polar chart
Ø Histogram
Ø Box plot
Ø Scatter charts
Bar Chart
• Most appropriate for visualizing, comparing and analyzing the distribution of
a categorical variable (nominal or ordinal). It can be used to show
frequencies or percentages for different levels of a variable
• Bar charts are best suited for data where categories are discrete and not
necessarily continous, using gaps between bars to empasize this distinction
Pie chart
• Most appropriate in visualization of the proportional contribution of a small
number of categories (2-5) to a whole, using nominal or ordinal data to
display counts or percentages
• They are best suited for situations where you want to immediately show parts
to- a- whole relationships such as market share, survey response proportions
or demographic breakdowns
Histogram
• Most appropriate in visualizing the distribution, shape, spread and center of
a single continous (interval or ratio) numerical variable.
• It is best applied when you need to inspect data for normality, identify
outliers (extreme values) or determine if the data is skewed
Line chart
• Most appropriate in visualization of trends in continuous data over time
(time-series data) or across ordered categories.
• Line graphs are superior to bar charts when identifying small, incremental
changes, or when comparing trends across multiple groups
Box Plot
• Most ideal graph for visualization of a distribution of a continous variable,
specificly for comparing medians, variability, skewness and identifying
outliers across multiple groups.
• It is a powerful exploratory data analysis tool used before conducting
inferential tests like ANOVA or T-tests to check normality of data
Scatter charts
• A scatter plot is most appropriate for examining the relationship, pattern or
association between two continuous quantitative variables.
• It helps to determine if a change in one variable (independent) is associated
with a change in another (dependent)
Clustered bar chart
• Can be used to give a graphical view of the relationship between two
categorical variables
• Select a clustered bar chart in the chart builder and drag it into the chart
preview area
• Select a variable for x-axis.
• Select a variable for the cluster on x: set color box
Means procedure
• The Means procedure is used to analyze the continuous variables. It works
with continuous dependent variables and categorical independent variables.
outliers
• Outliers are data points that differ significantly from other observations in a
dataset, indicating extreme values or potential errors
• The outliers – or extreme values – can represent a danger for the analysis,
because they directly affect mean and standard deviation. That’s why we
should detect, and in some cases remove the outliers before running such
tests
How to identify outliers
There are two methods for identifying the outliers:
1. A numerical method, based on the standardized values
2. A graphical method, bases on the boxplot chart
How to manage outliers
There are three kinds of outliers, depending on their source:

• Data entry errors, due to lack of attention, negligence, tiredness etc.

• Measurement or data collecting errors, due either to human mistakes or to equipment


malfunction.

• Real non-typical, unusual values in your population. These are the so called genuine
outliers
How to manage outliers
There are two basic solutions for dealing with genuine outliers:

• Remove the outliers from the data series


• Keep the extreme values in the data series
How to manage outliers
If we decide to keep the outliers, we have other four possible routes to choose
from:

1. Run a nonparametric test, because these tests are less sensitive to outliers.

2. Replace the outliers with values closer to the normal. Let’s suppose that
our data series look like this:
2.7 2.2 5.9 3.4 3.0 2.8
How to manage outliers contd..

3. Run the parametric test regardless, being aware of the possible effects of the
outliers.

4. Perform a so called sensitivity analysis: run both the parametric and the
nonparametric test. If the results are similar, we can conclude that the
outliers do not affect our findings.

You might also like