Data Analysis Lec 4

The document discusses crosstabulations, chi-square tests, and various graph types used for data analysis, emphasizing their application in exploring relationships between categorical variables. It also addresses the identification and management of outliers in datasets, including methods for detection and potential solutions for handling them. Key statistical concepts and visualizations are highlighted to aid in research and analysis.

Uploaded by

Victor Mosioma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views19 pages

Data Analysis Lec 4

Uploaded by

Victor Mosioma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Crosstabs

• Crosstabultions display a joint distribution of two or more categorical

variables
• Crosstabulations are commonly used to explore how demographic
characteristics are related to attitudes and behaviors.
• In general crosstabs are used to study the relationships between two or more
categorical variables
• Crosstab assumptions: variables used in crosstabs must be categorical
(nominal or ordinal)
Generating crosstabs
• Select variables for the crosstabs procedure, atleat one for the row and one
for the column dimension.
• Select percentage options
• Review the procedure output to investigate the relationship between the
variables including cell counts and percentages.
Chi-square output
value df Asymp. Sig (2-sided)
Pearson Chi-square 150.134a 8 .000
Likelyhood ratio 3.049 3 .000
Linear-by-linear 117.739 1 .000
association
N of valid cases 10

• In terms of formal testing, we often use significance values of .05 or .01 to

determine whether we should reject the null hypothesis or not
Chi-square test
• The chi-square test normally assumes that there is a null hypothes about the
relationships of the two variables to be tested. A null hypothesis normally
states that there is no relationship between the varaibles
• The significance value helps us to either accept or reject the null hypothesis
Graphs
• Graphs are used to visualize data distributions, identify trends, compare categories
and explore relationships between variables for research, analysis and presentations.
The following graphs can be generated in SPSS to visualize data:
Ø Bar chart
Ø Line chart
Ø Pie/polar chart
Ø Histogram
Ø Box plot
Ø Scatter charts
Bar Chart
• Most appropriate for visualizing, comparing and analyzing the distribution of
a categorical variable (nominal or ordinal). It can be used to show
frequencies or percentages for different levels of a variable
• Bar charts are best suited for data where categories are discrete and not
necessarily continous, using gaps between bars to empasize this distinction
Pie chart
• Most appropriate in visualization of the proportional contribution of a small
number of categories (2-5) to a whole, using nominal or ordinal data to
display counts or percentages
• They are best suited for situations where you want to immediately show parts
to- a- whole relationships such as market share, survey response proportions
or demographic breakdowns
Histogram
• Most appropriate in visualizing the distribution, shape, spread and center of
a single continous (interval or ratio) numerical variable.
• It is best applied when you need to inspect data for normality, identify
outliers (extreme values) or determine if the data is skewed
Line chart
• Most appropriate in visualization of trends in continuous data over time
(time-series data) or across ordered categories.
• Line graphs are superior to bar charts when identifying small, incremental
changes, or when comparing trends across multiple groups
Box Plot
• Most ideal graph for visualization of a distribution of a continous variable,
specificly for comparing medians, variability, skewness and identifying
outliers across multiple groups.
• It is a powerful exploratory data analysis tool used before conducting
inferential tests like ANOVA or T-tests to check normality of data
Scatter charts
• A scatter plot is most appropriate for examining the relationship, pattern or
association between two continuous quantitative variables.
• It helps to determine if a change in one variable (independent) is associated
with a change in another (dependent)
Clustered bar chart
• Can be used to give a graphical view of the relationship between two
categorical variables
• Select a clustered bar chart in the chart builder and drag it into the chart
preview area
• Select a variable for x-axis.
• Select a variable for the cluster on x: set color box
Means procedure
• The Means procedure is used to analyze the continuous variables. It works
with continuous dependent variables and categorical independent variables.
outliers
• Outliers are data points that differ significantly from other observations in a
dataset, indicating extreme values or potential errors
• The outliers – or extreme values – can represent a danger for the analysis,
because they directly affect mean and standard deviation. That’s why we
should detect, and in some cases remove the outliers before running such
tests
How to identify outliers
There are two methods for identifying the outliers:
1. A numerical method, based on the standardized values
2. A graphical method, bases on the boxplot chart
How to manage outliers
There are three kinds of outliers, depending on their source:

• Data entry errors, due to lack of attention, negligence, tiredness etc.

• Measurement or data collecting errors, due either to human mistakes or to equipment

malfunction.

• Real non-typical, unusual values in your population. These are the so called genuine
outliers
How to manage outliers
There are two basic solutions for dealing with genuine outliers:

• Remove the outliers from the data series

• Keep the extreme values in the data series
How to manage outliers
If we decide to keep the outliers, we have other four possible routes to choose
from:

1. Run a nonparametric test, because these tests are less sensitive to outliers.

2. Replace the outliers with values closer to the normal. Let’s suppose that
our data series look like this:
2.7 2.2 5.9 3.4 3.0 2.8
How to manage outliers contd..

3. Run the parametric test regardless, being aware of the possible effects of the
outliers.

4. Perform a so called sensitivity analysis: run both the parametric and the
nonparametric test. If the results are similar, we can conclude that the
outliers do not affect our findings.

EDA vs CDA in Data Analytics
No ratings yet
EDA vs CDA in Data Analytics
79 pages
EDA vs CDA: Key Differences Explained
No ratings yet
EDA vs CDA: Key Differences Explained
68 pages
Descriptive Statistics Overview Guide
No ratings yet
Descriptive Statistics Overview Guide
100 pages
Exploratory vs Confirmatory Data Analysis
100% (1)
Exploratory vs Confirmatory Data Analysis
48 pages
Business Club Data Analysis Guide
No ratings yet
Business Club Data Analysis Guide
26 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
13 pages
Data Analysis: Distributions & Associations
No ratings yet
Data Analysis: Distributions & Associations
22 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
41 pages
Chapter3 - Data Exploration
No ratings yet
Chapter3 - Data Exploration
79 pages
Understanding Data Objects & Types
No ratings yet
Understanding Data Objects & Types
64 pages
Understanding Data Types and Attributes
No ratings yet
Understanding Data Types and Attributes
47 pages
Basic Statistical Data Descriptions
No ratings yet
Basic Statistical Data Descriptions
63 pages
STAB22 Lecture's Notes
No ratings yet
STAB22 Lecture's Notes
64 pages
Univariate and Bivariate Data Analysis
No ratings yet
Univariate and Bivariate Data Analysis
370 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
46 pages
Understanding Data Analysis Techniques
No ratings yet
Understanding Data Analysis Techniques
10 pages
Understanding Data Types and Analysis
No ratings yet
Understanding Data Types and Analysis
25 pages
Hypothesis Testing and Data Visualization
No ratings yet
Hypothesis Testing and Data Visualization
43 pages
Descriptive Statistics Overview
No ratings yet
Descriptive Statistics Overview
26 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
29 pages
Xploratory Ata
No ratings yet
Xploratory Ata
16 pages
Visualizing Descriptive Statistics
No ratings yet
Visualizing Descriptive Statistics
29 pages
Histogram Analysis in Data Mining
100% (1)
Histogram Analysis in Data Mining
63 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
64 pages
Data Handlingppt
No ratings yet
Data Handlingppt
53 pages
R Programming Unit 2
No ratings yet
R Programming Unit 2
9 pages
Data Types and Similarities in Mining
No ratings yet
Data Types and Similarities in Mining
52 pages
Key Elements of Business Analytics EDA
No ratings yet
Key Elements of Business Analytics EDA
110 pages
Data Mining: Understanding Your Data
No ratings yet
Data Mining: Understanding Your Data
62 pages
Data Analysis and Visualization Guide
No ratings yet
Data Analysis and Visualization Guide
70 pages
Gold Medal Statistics and Analysis
No ratings yet
Gold Medal Statistics and Analysis
28 pages
Understanding Data Distributions and Analysis
No ratings yet
Understanding Data Distributions and Analysis
130 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
65 pages
Data Understanding and Preparation Guide
No ratings yet
Data Understanding and Preparation Guide
50 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
64 pages
Data Mining Concepts and Techniques Guide
No ratings yet
Data Mining Concepts and Techniques Guide
65 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
53 pages
Choosing Visualization for Data Distribution
No ratings yet
Choosing Visualization for Data Distribution
195 pages
2024 General Maths Summary Notes
No ratings yet
2024 General Maths Summary Notes
23 pages
Statistics Project Overview
No ratings yet
Statistics Project Overview
39 pages
Understanding Data Objects and Attributes
No ratings yet
Understanding Data Objects and Attributes
36 pages
Organizing Numerical and Categorical Data
No ratings yet
Organizing Numerical and Categorical Data
31 pages
Understanding Data Objects & Attributes
No ratings yet
Understanding Data Objects & Attributes
78 pages
EDA Techniques for Data Analysis
No ratings yet
EDA Techniques for Data Analysis
12 pages
Data Mining: Understanding Data Basics
No ratings yet
Data Mining: Understanding Data Basics
44 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
24 pages
Data Science: EDA & Visualization Techniques
No ratings yet
Data Science: EDA & Visualization Techniques
50 pages
Understanding Data and Statistics
No ratings yet
Understanding Data and Statistics
60 pages
Understanding Data Objects and Types
No ratings yet
Understanding Data Objects and Types
69 pages
Understanding Biostatistics Basics
No ratings yet
Understanding Biostatistics Basics
49 pages
Data Visualization & Statistical Methods
No ratings yet
Data Visualization & Statistical Methods
36 pages
Exploratory Data Analysis Essentials
No ratings yet
Exploratory Data Analysis Essentials
2 pages
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
No ratings yet
CH - 2 - Application To Univariate and Bivariate Analysis in Stata
32 pages
Unit III - Data Preprocessing and Exploration
No ratings yet
Unit III - Data Preprocessing and Exploration
63 pages
02 Data
No ratings yet
02 Data
76 pages
Actor-Critic Methods in RL
No ratings yet
Actor-Critic Methods in RL
114 pages
Harry Ricardo's Engine Innovations
No ratings yet
Harry Ricardo's Engine Innovations
8 pages
Understanding Body Language in Culture
No ratings yet
Understanding Body Language in Culture
4 pages
Understanding Conversation Analysis Techniques
No ratings yet
Understanding Conversation Analysis Techniques
3 pages
Introduction to SELinux Administration
No ratings yet
Introduction to SELinux Administration
31 pages
Analysis of Steve Cutts' "Happiness"
No ratings yet
Analysis of Steve Cutts' "Happiness"
4 pages
Beer’s Law: Light Absorption Study
No ratings yet
Beer’s Law: Light Absorption Study
8 pages
S-R Theory Explained in Hindi
No ratings yet
S-R Theory Explained in Hindi
15 pages
JEE Milestone Test 3 Results 2024
No ratings yet
JEE Milestone Test 3 Results 2024
3 pages
Design Method for Mega Composite Columns
100% (2)
Design Method for Mega Composite Columns
74 pages
Mat English M5 PDF
0% (1)
Mat English M5 PDF
52 pages
74151 and 74153 Pinout Information
No ratings yet
74151 and 74153 Pinout Information
4 pages
Iloilo Project 2 Market Analysis Report
No ratings yet
Iloilo Project 2 Market Analysis Report
10 pages
AASTU Student Internship Manual
No ratings yet
AASTU Student Internship Manual
23 pages
Blue Eyes Technology Overview Seminar
0% (1)
Blue Eyes Technology Overview Seminar
22 pages
Dual Drive High Energy Planetary Mill
No ratings yet
Dual Drive High Energy Planetary Mill
24 pages
Applications of the Central Limit Theorem
No ratings yet
Applications of the Central Limit Theorem
6 pages
Speech Writing Process Essentials
No ratings yet
Speech Writing Process Essentials
34 pages
Identifying ISO 9001:2015 Nonconformities
71% (7)
Identifying ISO 9001:2015 Nonconformities
2 pages
Cross-Cultural Communication Insights
No ratings yet
Cross-Cultural Communication Insights
2 pages
Mellen Thomas Near Death Experience
100% (9)
Mellen Thomas Near Death Experience
50 pages
Unplanned Spotcheck in Maintenance Management
100% (1)
Unplanned Spotcheck in Maintenance Management
20 pages
2nd Shift CNC Operator Position in IN
No ratings yet
2nd Shift CNC Operator Position in IN
2 pages
Game Theory An Introduction by Steven Tadelis
No ratings yet
Game Theory An Introduction by Steven Tadelis
75 pages
Hofstede's Six Cultural Dimensions Explained
No ratings yet
Hofstede's Six Cultural Dimensions Explained
17 pages
Overview of Accreditation Agencies
No ratings yet
Overview of Accreditation Agencies
6 pages
ITC Interrobang Season 2 Case Challenge Brochure
No ratings yet
ITC Interrobang Season 2 Case Challenge Brochure
7 pages
One Indian Girl: Individuality Explored
No ratings yet
One Indian Girl: Individuality Explored
6 pages
Night Hawk of the Woods: Robin Hood Adventure
No ratings yet
Night Hawk of the Woods: Robin Hood Adventure
36 pages

Data Analysis Lec 4

Uploaded by

Data Analysis Lec 4

Uploaded by

Crosstabs

• Crosstabultions display a joint distribution of two or more categorical

• In terms of formal testing, we often use significance values of .05 or .01 to

• Data entry errors, due to lack of attention, negligence, tiredness etc.

• Measurement or data collecting errors, due either to human mistakes or to equipment

• Remove the outliers from the data series

You might also like