0% found this document useful (0 votes)

8 views50 pages

Descriptive Statistics Analysis Guide

Q: How do statistical tests like the chi-square test and McNemar test evaluate differences between groups?

The chi-square test evaluates differences between groups by comparing observed and expected frequencies of variables, primarily used in contingency tables to test the independence of categorical variables . It is effective for comparing proportions across several groups . McNemar's test, on the other hand, is used for paired nominal data, particularly to evaluate before-and-after scenarios in single samples, to determine the effectiveness of interventions . It assesses whether the marginal frequencies of two dependent samples differ significantly .

Q: Evaluate the utility of descriptive plots such as histograms and boxplots in understanding the characteristics of data distributions.

Descriptive plots like histograms and boxplots are invaluable in data analysis as they provide visual insights into data distribution. Histograms display frequency distributions, helping identify skewness, modality, and outliers . Boxplots reveal data spread, central tendency, and variability; they also highlight quartiles and potential outliers through whiskers and box lengths. These plots enable quick assessments of data symmetry, outlier presence, and variance, aiding in the decision-making of appropriate statistical methods and transformations .

Q: What is the main purpose of exploratory data analysis, and how does it differ from descriptive statistics?

Exploratory data analysis is primarily used to examine data for outliers, test assumptions such as normality and homogeneity of variance, and understand group differences . It provides insights into data characteristics through visual methods like stem-and-leaf diagrams, histograms, and boxplots . In contrast, descriptive statistics focus on summarizing data characteristics, including computing central tendency and dispersion, without probing underlying assumptions or data distribution differences .

Q: Explain how the concept of skewness and kurtosis is used to determine data normality and why this is important in statistical analysis.

Skewness and kurtosis are measures used to evaluate the asymmetry and peakedness of a data distribution, respectively . Skewness indicates whether data values are skewed to the left or right of the peak, while kurtosis assesses the tail heaviness of the distribution. These metrics are critical in determining data normality, which is a key assumption in many parametric tests used in statistical analysis. A normal distribution is symmetrical with skewness of zero and kurtosis near zero. Evaluating these helps verify this assumption, ensuring appropriate test application and reliable results .

Q: Discuss the role of standardized values in descriptive statistical analysis and the implication of transforming variables using Z-scores.

Standardized values, or Z-scores, play a crucial role in descriptive analysis by transforming variables to have a mean of zero and a standard deviation of one . This transformation allows for the comparison of variables across different scales and units by converting them into a common scale. It is particularly useful when comparing scores from different tests or conditions, as it normalizes the data, making it easier to interpret relative standings and measure outliers within the dataset .

Q: Discuss the implications of the results of the chi-square test applied to stratified data concerning the relationship between smoking and lung cancer, and what this suggests about risk factors.

The chi-square test on stratified data concerning smoking and lung cancer revealed a significant correlation, highlighting smoking as a potential risk factor for lung cancer . Stratified analyses further disentangle such relationships by accounting for confounding variables like gender. Significant results across stratified groups suggest a consistent risk factor presence, reinforcing smoking's impact on lung cancer across different demographics. This highlights the need for targeted interventions and supports the public health focus on smoking cessation to mitigate lung cancer risk .

Q: How can data transformation and the choice of parametric or non-parametric methods be affected by the assumptions of normality and homogeneity of variance?

Data transformation and the choice between parametric or non-parametric methods are influenced by assumptions of normality and homogeneity of variance. When data do not meet these assumptions, transformations such as log or square root may normalize data or reduce heteroscedasticity. If transformations are unsuccessful, non-parametric methods, which do not require these assumptions, become preferable. These adjustments ensure accurate test application and valid results in statistical analyses by aligning data with methodological prerequisites .

Q: How does the chi-square test determine the relationship between variables in different sample groups, and what statistical significance implies in this context?

The chi-square test determines the relationship between variables in different sample groups by calculating if the observed frequency distribution of categorical variables significantly deviates from what would be expected under independence . It assesses the interaction between row and column variables in contingency tables. Statistical significance in this context implies that the observed association or difference in proportions is unlikely due to chance, suggesting a potential relationship or effect between the variables being studied .

Q: What is the significance of using bootstrap methods in statistical analysis, particularly in terms of calculating confidence intervals?

Bootstrap methods are significant in statistical analysis as they allow for estimating the distribution of a statistic by resampling with replacement from the data. This is particularly useful for calculating confidence intervals when the underlying distribution is unknown or when traditional methods can't be applied due to sample size constraints. Bootstrap provides more robust estimates of confidence intervals, enhancing inference by producing percentile-based or bias-corrected accelerated intervals, thus ensuring more accurate reflections of data variability .

Q: In what scenarios would multiple comparison procedures be necessary after conducting a chi-square test, and how are these implemented?

Multiple comparison procedures are necessary after conducting a chi-square test when multiple categories are involved, and the test reveals significant differences. These procedures help identify which specific groups differ from others. They are implemented using pair-wise comparisons or post-hoc tests, often adjusted with methods like Bonferroni correction to account for increased Type I error rates due to multiple testing . This ensures the reliability and validity of the conclusions drawn about which groups specifically exhibit significant differences .

Chapter 4 discusses the analysis of descriptive statistics, detailing processes such as Frequencies, Descriptives, and Explore. It provides examples and commands for analyzing data distributions, calculating descriptive statistics, and exploring data characteristics, including outlier detection and normality tests. The chapter emphasizes the importance of understanding data distribution and variance for effective statistical analysis.

Uploaded by

Kamuya Beata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views50 pages

Descriptive Statistics Analysis Guide

Uploaded by

Kamuya Beata

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Chapter 4 Analysis of Descriptive Statistics

The descriptive Statistics (basic Statistical Analysis) in the submenu of statistical

analysis includes seven processes: Frequencies, Descriptives, Explore, Crosstabs, P-P
Plots and Q-Q Plots (Figure 4-1).

Figure 4-1 Basic Statistical Analysis menu in Statistical Analysis Module

4. 1 Frequencies

4.1.1 Description

The analysis of frequency distribution mainly describes the distribution

characteristics of data by means of frequency distribution table, bar chart and histogram,
as well as various statistics of concentrated trend and discrete trend.

4.1.2 Example 4-1

The data file “diameter_sub.sav” is used as the example 4-1, which records the
sagittal diameter of 216 human spine vertebrae. Try to make a descriptive analysis of
the variable “trueap_mean” (sagittal diameter) and draw histograms.
4.1.3 Running the command

Analyze
Descriptive Statistics
Frequencies
The dialog box of “Frequencies” pops out (Figure 4-2).

Figure 4-2 The Frequencies dialog box

Display frequency tables.

★Statistics：Click “Statistics” button，and the dialog box of “Frequencies：
Statistics” pops out (Figure 4-3).
Figure 4-3 The Frequencies: Statistics dialog box
◇Percentile Values.
Quartiles.
□Cut points for: 10 equal groups: All observed values are divided equally
according to percentile n. The default value of the system is 10, that is the output of
the value of P10, P20, …, P90.
□Percentile (s): Select a custom percentile. The “Add” button is activated after
the value is filled in the text box.
◇Central Tendency.
Mean.
Median.
Mode.
□Sum.
◇Dispersion.
[Link]. Minimum.
Variance. Maximun.
Range. [Link].
□Values are group midpoints：When calculating the percentile, it is assumed
that the current data has been divided into different groups. The current data is
calculated as the group median of each group.
◇Distribution.
Skewness.
Kurtosis.
★Charts: Click “Charts” button，and the dialog box of “Frequencies: Charts”
pops out (Figure 4-4).

Figure 4-4 The Frequencies: Charts dialog box

◇Chart Type.
◎None.
◎Bar charts.
◎Pie charts.
Histograms: Select this item and activates the following options in the example
4-1.
Show normal curve on histogram.
◇Chart Values: This option is valid only for bar and circle diagrams.
Frequencies. ◎Percentages.
★Format: Click “Format” button, and the dialog box of “Frequencies: Format”
pops out (Figure 4-5).
Figure 4-5 The Frequencies: Format dialog box

◇Order by.
Ascending values.
◎Descending values.
◎Ascending counts.
◎Descending counts.
◇Multiple Variables.
Compare variables.
◎Organize output by variables.
□Suppress tables with many categories.
Maximum number of categories： 10
★Bootstrap：The relevant statistics are calculated by bootstrap method.
Click “Bootstrap” button，and the dialog box of “Bootstrap” pops out (Figure 4-
6).
Figure 4-6 The Frequencies: Bootstrap dialog box
Perform bootstrapping.
Number of samples: 1000
Set seed for Mersenne Twister.
Seed: 2000000 : The positive integer between 1~2000000000 is optional.
If seed set is not allowed, the result of each run is different. The default number is
2000000.
◇Confidence Intervals.
Level (%) : 95 .
Percentile: The confidence interval is calculated according to the percentile
method.
◎Bias corrected accelerated (BCa): The confidence interval can be calculated by
the accelerated bootstrap sampling after bias correction.
◇Sampling: Bootstrap sampling method.
Simple: Simple sampling method, namely the currently selected variables as
a whole sampling (system default).
◎Stratified：The method of stratified sampling is that bootstrap sampling is
carried out independently in each layer according to the stratified variables selected in
the “Strata Variables” box.

4.1.4 Reading the output

(1) Without Bootstrap sampling, the main results obtained according to the options
in the “Statistics” and “Charts” are shown in Figure 4-7 and Figure 4-8. Skewness
coefficient and its standard error are -0.189 and 0.166 respectively, and Z=-
0.189/0.166= -1.1386 (P=0.2549). Kurtosis coefficient and its standard error are -0.057
and 0.330 respectively, and Z=-0.057/0.330=-0.1727 (P=0.8629). Combined with the
two results, the data is considered to be normally distributed.
In the Example 4-1, the probability of bilateral (tail) is calculated by using the
function of CDFNORMAL (Z, 0, 1) in Compute variable.

Figure 4-7 Output of the example 4-1

Figure 4-8 Frequency distribution histogram and normal curve
(2) The Bootstrap sampling the main output is shown in Figure 4-9. To calculate
the mean and standard deviation, for example, according to the Bootstrap default option,
95% confidence interval of mean and standard deviation are 14.3459~ 14.5337 and
0.65137~0.78666, respectively.

Figure 4-9 The main output results of Bootstrap sampling

4. 2 Descriptives

4.2.1 Description

Descriptive statistical analysis is mainly used to calculate all kinds of statistics

which describe the trend of concentration and discrete trend. In addition, one important
function is to make standardized transformation of variables, namely Z transformation.

4.2.2 Example 4-2

The data file “clinical [Link]” is used as the example 4-2. Four variables “HB1”
(pre-treatment hemoglobin), “RBC1”(pre-treatment red blood cells), “WBC1” (pre-
treatment white blood cells) and “PLT1” (pre-treatment platelet) in the data file. Please
give the descriptives of four variables in the data file.

4.2.3 Running the command

Analyze
Descriptive Statistics
“Descriptives” dialog box pops out (Figure 4-10). Select the variable “HB1, RBC1,
WBC1, PLT1” into the Variable (s) box.
□Save standardized values as variables: Standardize the analysis variables. This
option produces a standardized value (Z score) and stores the Z score in the data file as
a new variable named “Z” before the original variable name. Z points calculation

formula is: zi  ( xi  x) / s , which is suitable for analysis of variable xi each observation,

x for the variable mean, s for standard deviation.

★Options：Click “Options” button, and the dialog box of “Descriptives: Options”

pops out (Figure 4-11).
Variable list
Alphabetic
Ascending means
Descending means
★ Bootstrap

Figure 4-10 The Descriptives dialog box

Figure 4-11 The Descriptives: Options dialog box

4.2.4 Reading the output

The output of the example 4-2 is shown in Figure 4-12, which shows that the
descriptive statistical analysis process is exactly the same as the Statistics output of the
“Frequencies” process. The only difference with the Description procedure is that there
is an option to generate standardized values (Save standardized values as variables).

Figure 4-12 The output of the example 4-2

4. 3 Explore

4.3.1 Description

Exploratory analyses mainly include the following objectives:

Check the data for outliers and /or extreme values.
The premise assumptions, such as normal distribution and homogeneity of
variance test, can be tested. When the distribution and variance homogeneity of data
are not satisfied, the method of data conversion is suggested, and the parametric method
or non-parametric method is adopted.
Understanding the characteristics of differences between groups.
The statistics, normality test and descriptive statistical diagrams, including stem
and leaf diagrams, histogram and box diagram, can be given in the explore process.

4.3.2 Example 4-3

The data file “clinical [Link]” is used as the example 4-3. The variable “PLT1”
(pre-treatment platelet) in the example 4-3 is analyzed by grouping variable as “group”.

4.3.3 Running the command

Analyze
Descriptive Statistics
Explore
The dialog box of “Explore” is listed in Figure 4-13.
Figure 4-13 The Explore dialog box

◇Dependent List: Optional dependent variables (explanatory variables), that is,

exploratory analysis of the variables, generally for the measurement type, can choose
one or more. Select the variable “PLT1” in the example 4-3.
◇Factor List: Options for a category (GROUP) variable, typically a count or
class type, with one or more optional. Select the grouping variable “GROUP” in the
example 4-3.
◇Label Cases by: In general, only one classified variable or identification
variable can be selected to mark the observation unit. In the example 4-3, the
identification variable “ID” is selected.
◇Display: Output content.
Both: “Statistics” and “Plots” are outputted (system default).
★Statistics：Click “Statistics” button，and the dialog box of “Explore:
Statistics” pops out ( Figure 4-14 ). This example is fully selected .
Figure 4-14 The Explore: Statistics dialog box

Descriptives: Confidence Interval for Mean 95 %: The confidence interval of

the total mean. The system defaults to 95%.
M-estimators.
Outliers: Displays five maximum and five minimum values.
Percentiles.
★Plots：Click “Plots” button, and the dialog box of “Explore: Plots” pops out
(Figure 4-15 ).

Figure 4-15 The Explore: Plots dialog box

◇Boxplots.
Factor levels together: For each category variable, only one dependent
variable is shown per graph (system default). Select this item in the example 4-3.
Dependents together: For each category variable, each graph shows all
dependent variables.
None.
◇Descriptive.
Stem-and-leaf: Stem and leaf diagrams, used to describe the frequency
distribution, replace the groups in the frequency table with actual values, and the values
are composed of Stem and leaf. Figure 4-16 is one of the analysis results of the example
4-3. In this case, the stem width is 100, and each leaf represents one case. For example,
there are 2 cases with platelet count of 80×109/L, 1 case with 120× 109/L, 3 cases with
130×109/L, and so on.
□Histogram
□Normality plots with tests: The normality test is made and the normal probability
diagram is drawn. The Kolmogorov-Smirnov statistic is given and the Shapiro-Wilk
statistic is given when the sample size ≤ 50.

Figure 4-16 Stem and leaf plots of test group before treatment

◇Spread vs Level with Levene Test: Levene variance homogeneity test, there are
four options.
If no homogeneity test of variance is made, “None” is selected.
If we do the homogeneity test of variance, we first select “Untransformed” to test
the homogeneity of variance on the original data. If the data satisfy the homogeneity,
testing ends here.
If the data don’t satisfy the homogeneity, we choose “Power estimation” to
determine the power transformation method.
At last, we try to find a method to satisfy the homogeneity in six power
transformation methods. If the above efforts cannot meet the requirement of
homogeneity, nonparametric analysis should be considered.
◎None.
◎Power estimation: The best power transformation value can be obtained by
power transformation estimation, which provides a reference for selecting the following
power transformation methods, in order to achieve the purpose of homogeneity of
variance.
◎Transformed Power: Power transformation method. After selection this code,
power transformation method box is activated, the following methods are available.
Natural log
1/square root
Reciprocal
Square root
Square
Cube
Untransformed
★Options: Click “Options” button, and the dialog box of “Explore: Options”
pops out (Figure 4-17).

Figure 4-17 The Explore: Options dialog box

◇Missing Values: Determine how missing values are handled.

Exclude cases listwise: For each observation unit, as long as one variable
selected in the analysis is a missing value, the observation unit is regarded as the
missing value and does not participate in the analysis process (system default).
Exclude cases pairwise: For a unit of observation, only the missing value of the
variable and the variable related to the analysis of the variable is considered to be
missing.
Report values: The observation units with missing values in the classification
variables is analyzed separately and the corresponding output results is obtained.

4.3.4 Reading the output

(1)Description of the units involved in the analysis process: 72 cases in each group,
no missing data (Figure 4-18).

Figure 4-18 Description of the unit of observation

(2) Descriptive statistics: Some of the results are as follows (Figure 4-19):
1) 5% Trimmed Mean: The average after the maximum and the smallest
observations of 5% are removed.
2) Interquartile Range.
Figure 4-19 Output of descriptive statistics

(3)M-estimators: M-estimator is a robust estimator of the concentration trend, and

lists four kinds of M-estimators: Huberger, Tukey, Hampel and Andrews's. In addition
to giving the estimator, Hagrid deals with the weighted constant of the estimator
calculated by different methods (Figure 4-20).

Figure 4-20 Output of M estimators

(4)Percentiles: The results of weighted average and Tukey method (limited to
quartile) are given respectively (Figure 4-21).
Figure 4-21 Output of percentiles
(5)Normality test: the results of Kolmogorov-Smimov method and Shapiro-Wilk
method are given in Figure 4-22.
1)Sig. (significance level): P value. All P values are less than or equal to 0.024,
indicating that the platelet counts in both groups are not subject to normal distribution.
Generally speaking, the greater the P value, the more support data from the normal
distribution.
2)df (degrees of freedom).

Figure 4-22 Output of normality test

3)A normal test Q-Q chart of platelet count distribution in the trial group. If the
data follow normal distribution, the distribution of scattered points is close to a straight
line. The example 4-3 does not support normal distribution (Figure 4-23).
Figure 4-23 Normal Q-Q Plot of Blood platelet level before treatment for test group

(6)Homogeneity test of variance: The results of Levene's homogeneity test of

variance are given in Figure 4-24, and four algorithms for calculating Levene statistics
are listed.

Figure 4-24 The results of Levene's homogeneity test of variance

(7)Boxplot and extreme values: In the boxplot (Figure 4-25), five straight lines
represent five percentiles, and the height of the box is the quartile range (=P75 - P25). It
should be pointed out that the boxplot is formed after eliminating the outliers and
extreme values of variables. The hollow dots (◦) in the Figure represent outliers, that is,
the distance between the observed values and the bottom line or the top line of the box
is 1.5 to 3 times of the height of the box, which is regarded as the outlier. The asterisk
“” in the diagram represents the extreme value, that is, when the observed value is
more than 3 times the height of the box from the bottom line or top line of the box, it is
regarded as the extreme value. Figure 4-26 is the output of option Outlier. Each variable
lists 5 maximum and 5 minimum values

Figure 4-25 The Boxplot

Figure 4-26 Output of the option outlier

4. 4 Crosstabs

Column table data refers to the frequency distribution table of each level
combination of two or more classified variables, also known as frequency crosstabs,
abbreviated as crosstabs. This process provides a variety of testing and correlation
measurement methods for the analysis of two-dimensional or high-dimensional linked
table data. χ2 test is a commonly used hypothesis test method for the analysis of the data
in the column table, which focuses on the introduction of the content.

4.4.1 χ2 Test for comparison of two Independent sample rates

[Link] Description
There are two data formats in which two independent sample rates are compared:
one is frequency tabular, as in the Example 4-3 and the other is original record format,
as in the Example 4-4.
[Link] Example 4-4
To compare the efficacy of ultraviolet and antiviral drugs in the treatment of herpes
zoster, patients with herpes zoster are randomly divided into two groups. See Table 4-
1 for clinical observations. Are there any differences in overall effectiveness between
the two groups?
Table 4-1 Comparison of the efficacy of ultraviolet and antiviral drugs in the
treatment of herpes zoster
group valid invalid total effective rate

Antiviral group 31 25 56 55.36

Ultraviolet group 55 9 64 85.94
total 86 34 120 71.67

[Link] SPSS data format

The data file “chi2_2.sav” is used as the Example 4-4, which includes 4 rows and
3 columns. Three variables are row variables, column variables and frequency variables
(Figure 4-27).
Classified variable (row variable): row variable named “group”, “1” stands for
antiviral group, and “2” stands for ultraviolet group.
Classified variable (column variable): the column variable is named “effect”, “1”
stands for valid and “2” stands for invalid.
Frequency variable: variable named “freq”. Enter the four frequencies in the four-
cell table.

Figure 4-27 Data file format of χ2test

[Link] Running the command`

Select from the menu
Data
Weight Cases
The dialog box of “Weight Cases” pops out. Select “weight cases by” box, and
select “freq”to specify the variable as a frequency variable.
Select from the menu
Analyze
Descriptive Statistics
Crosstabs
The dialog box of “Crosstabs” is listed in Figure 4-28.

Figure 4-28 The Crosstabs dialog box

◇Row(s): This example is selected as “group”.

◇Column(s): This example is selected as “effect”.
◇Layer: To control a variable, the variable determines the layer of the frequency
distribution table. If you want to add another control variable , click “Next”, and then
select a variable. Click the “Previous” button to select the previously determined
variable.
□Display layer variables in table layers: Used to control whether hierarchical
variables are displayed in the frequency table. However, it does not affect the output
form of relevant statistics.
□Display clustered bar charts.
□Suppress tables.
★Exact: The exact probability test.
★Statistics: Click the “Statistics” button, and dialog box of “Crosstabs: Statistics”
pops out (Figure 4-29).

Figure 4-29 The Crosstabs: Statistics dialog box

□Chi-square: The results of Pearson χ2 test, Likelihood ratio χ2 test, Yates

Continuity Correction χ2 test and Fisher's Exact test can be outputted for the four lattice
table data.
□Correlations: Pearson and Spearman correlation coefficients are calculated to
show the correlation between row variables and column variables.
◇Nominal: Association measurement of two categorical variables.

□Contingency coefficient: C=  2 /（ 2  N）, where N is the total number of

cases. C is between 0 and 1, the greater the value of contingency coefficient, the
stronger the correlation of variables.
□Phi and Cramer's V: φ and Cramer column contact number. Φ=  2 /N ; V=

 2 /（N（k -1)) . Here k is the smaller number of rows and columns. For the four

lattice table data, φ =V. Both values are between 0 and 1, the greater the value, the
stronger the correlation.
□Lambda: In order to reduce the prediction error rate, a value between 0 and
1, where 1 indicates the best prediction effect and 0 indicates the worst.
□Uncertainty coefficient: It also belongs to the reduction of prediction error
rate, which has the same meaning as Lambda and has two kinds of calculation results:
symmetric and asymmetric.
◇Ordinal:Correlation degree measurement of two ordered classification variables
(rank variables).
□Gamma: Statistics for measuring the correlation between two rank variables.
γ=(P-Q)/(P+Q). Here P is Concordant pairs, and Q is Discordant pairs. γ ranges
between -1 and +1, where +1 means perfect positive relationship and -1 means perfect
negative relationship. If γ equals 0, it means no correlation at all.
□Somers'd: This statistic is an extension of the Gamma statistic, which is only
different from the Gamma statistic in that the denominator is added to the
unsymmetrical pairs (Tied pairs). The range and significance of the values are the same
as those of Gamma.

□Kendall's tau-b: The formula is τb=（P-Q）/ （P  Q  TX）（P  Q  TY ) ,

where TX is the neutral number of the first variable and TY is the neutral number of the
second variable.

□Kendall's tau-c：The formula is τc=2m(P-Q)/(N2(m-1)). Here m is the number

of rows and the smaller number of columns N is the total sample number.
◇Nominal by Interval: The correlation between a qualitative variable and a
quantitative variable.
□Eta: Correlation statistics.
□Kappa: κ coefficient is measure of agreement coefficient, used to measure the
degree of coincidence between two observers or two observation equipment. κ
coefficient ranges from -1 to +1. The larger the absolute value of κ coefficient, the
higher the degree of coincidence.
□Risk: The risk analysis is only suitable for the four-grid data, and the Relative
risk (RR) and Odds ratio (OR) can be given.
□McNemar: χ2 test of paired counting data.
□Cochran's and Mantel-Haenszel statistics: The Mantel-Haenszel common OR
test is used to test whether the two binary variables are independent under the condition
of the existence of covariables (hierarchical variables) or after subtracting the influence
of covariables. After selecting this item, Test common odds ratio equals: 1 is
activated. Number 1 is system default, and the difference between common OR value
and 1 is statistically significant.
★Cells: The column table displays the contents.
Click the “Cells” button, and dialog box of “Crosstabs: Cell Display” pops out
(Figure 4-30).

Figure 4-30 The Crosstabs :Cell Display dialog box

◇Counts
□Observed
□Expected
□Hide small counts
Less than 5 : Less than n cases, the system default that less than 5 cases do not
have output.
◇Z-test: Z test based on normal distribution.
□Compare column proportions: The column variables of the row list are used
as grouping variables to compare the relative numbers of each row.
□Adjusted P-values (Bonferroni method): If the column variables have three
or more categories, the correction of P value should be considered. The method
provided here is Bonferroni method.
◇Percentages.
□Row.
□Column.
□Total.
◇Residuals.
□Unstandardized.
□Standardized.
□Adjusted Standardized.
◇Noninteger Weights: Processing of non-integer frequency variables.
Round cell counts: The frequency per cell is not rounded, but the cumulative
frequency is rounded before calculating the statistics.
◎Round case weights: First round all frequencies.
◎ Truncate cell counts: The frequency per cell is not rounded, but the
cumulative frequency is rounded before calculating the statistics.
◎Truncate case weights: First, all frequencies are rounded.
◎No adjustments.
★Format：Click the “Format” button, and the dialog box of “Crosstabs: Table
Format” pops out (Figure 4-31).
Figure 4-31 The Crosstabs: Table Format dialog box
◇Row order.
Ascending: System default.
◎Descending.
★Bootstrap
[Link] Reading the output
The whole process of this example is as follows
Data
Weight Cases
Weight Cases by: freq
Analyze
Descriptive Statistics
Crosstabs
Row(s): group
Column(s): effect
Statistics
Chi-square
Cells
Observed
Row
(1)Frequency distribution table (Figure 4-32).
(2)Test results (Figure 4-33).
1)Pearson Chi-Square: Uncorrected χ 2 test, suitable for R × C table data.
2)Continuity Correction: The calibration χ2 test is used only for four lattice table
data.
3)Likelihood Ratio: The likelihood ratio χ2 test is suitable for R × C table data.
4)Fisher's Exact Test: Only the four-cell table data is output by default.
5)Linear-by-Linear Association: A linear trend test is used to analyze whether a
classified variable is linearly associated with a hierarchical variable, but other cases can
be ignored.

Figure 4-32 Frequency distribution table

Figure 4-33 Chi-Square test results

1)N of Valid Cases.

2)The theoretical frequency of each lattice is more than 5, and the minimum
theoretical frequency is 15.87.
3)χ2=13.755 ， v=1 ， P<0.001 (two-tailed test). The difference is statistically
significant. It can be concluded that UV treatment of herpes zoster is superior to
antiviral drugs.
4)The exchange of row variables and column variables does not change the results
of χ2 test. This conclusion is suitable for χ2 test in all column tables.
[Link] Example 4-5
The data file “clinical “[Link]” is used as the example 4-5. Try to compare the
distribution of gender between the two groups of the classified variable “group”.
Because of the raw data format, you do not need to specify the frequency variables, the
whole process is as follows:
Analyze
Descriptive Statistics
Crosstabs
Row(s): group
Column(s): gender
Statistics
Chi-square
Cells
Observed
Row
The results are shown in Figures 4-34 and 4-35. There is no statistical difference
in sex distribution between the two groups (χ2=0.273, P=0.601).

Figure 4-34 Gender distribution in both groups

Figure 4-35 Chi-Square test results

4.4.2 χ2 Test of R × C Table data

[Link] Description
It is mainly used for the comparison of multiple sample rates and two or more
sample composition ratios.
[Link] Example 4-6
The etiological results of 504 pediatric patients in a hospital are shown in Table 4-
2. The etiological positive rate is correlated with age.
Table 4-2 Results of etiological detection in different age groups

Age groups Etiological detection Total Positive rate

(years) negative positive (%)

44 30 14 31.8
1~ 50 60 110 54.4
3~ 88 107 195 54.9
6~13 69 86 155 55.5
Total 237 267 504 53.0

[Link] SPSS data format

Create a data file “chiR_C.sav”. Three variables are row variables, column
variables and frequency variables (Figure 4-36).
Classified variable (row variable): The row variable is called “age_g”, where “1”
stands for less than 1year, “2” stands for 1~3 year, “3” stands for 3~6 years, and “4”
stands for 6~13 years old.
Classified variable (column variable): column variable named “aetiology”, where
“0” stands for the negative effect, and “1” stands for the positive effect.
Frequency variable: variable named “freq”, enter 8 frequencies in Table 4-2.

Figure 4-36 The data file format

[Link] Running the command`
Data
Weight Cases
Weight Cases by: freq
Analyze
Descriptive Statistics
Crosstabs
Row(s): aetiology
Column(s): age_g
Statistics
Chi-square
Cells
Observed
Row
Column
Compare column proportions
Adjusted p-values（Bonferroni method）
[Link] Reading the output
The frequency distribution table is shown in Figure 4-37, and the test results are
shown in Figure 4-38. Therefore, it can be seen that the pathogeny positive rate is
related to age, as the χ2 =8.688, v=3, P=0.034 (two-tailed test). Examined the linear
trend is statistically significant, χ2 = 3.956, v = 1, P = 0.047 (two-tailed test). The
positive rate of pathogeny increase along with the age increasing trend, but the trend
is mainly embodied in “< 1 year old” and “the age of 1~3” the change of the group,
after one year of age. The theoretical frequency of all the grids is greater than 5, and
the minimum theoretical frequency of the grids is 20.69. Multiple comparisons of the
positive rate between different age groups are shown in Figure 4-37. The results
showed that the pathogen is positive for those under 1 year old.

Figure 4-37 Frequency distribution table

Figure 4-38 Chi-Square Test results

4.4.3 χ2 test and κ coefficient test for paired counting data

[Link] Example 4-7

A total of 65 patients with respiratory tract infection are treated with an antibiotic.
The results of bacteriological examination before and after treatment are shown in Table
4-3. Try to analyze whether the antibiotic is effective in the treatment of respiratory
tract infections.
Table 4-3 Observation results of antibiotics in the treatment of respiratory tract
infections
Bacteriological examination before Bacteriological examination after Total
treatment treatment

- +
- 20 2 22
+ 29 14 43
Total 49 16 65

[Link] SPSS data format

The data file “chi_pair.sav” is used as the example 4-7. The file has four rows and
three columns. Three variables are row variables, and column variables and frequency
variables.
Classified variable (row variable): The row variable is called “treat_b”, where “0”
stands for negative and “1” stands for positive effect.
Classified variable (column variable): column variable named “treat_a”, where “0”
stands for negative and “1” stands for positive effect.
Frequency variable: variable named “freq”, enter 4 frequencies in Table 4-3.
[Link] Running the command
Data
Weight Cases
Weight Cases by: freq
Analyze
Descriptive Statistics
Crosstabs
Row(s): treat_b
Column(s): treat_a
Statistics
McNemar
□ Descriptive:
[Link] Reading the output
The frequency distribution is shown in Figure 4-39, and the test results are shown
in Figure 4-40. The method used is based on the binomial McNemar test. There is a
statistical difference (P<0.001) that the antibiotic is effective in the treatment of
respiratory tract infections.

Figure 4-39 Frequency distribution table

Figure 4-40 Chi-Square Test results

[Link] Example 4-8

The data file “[Link]” is used as the example 4-8. The diagnostic results of
116 patients are shown in table 4-4. Please use κ coefficient method to analyze the
coincidence between CT diagnosis and pathological diagnosis.
Table 4-4 Diagnostic results of two examination methods for patients

CT examination Pathological examination Total

inflammation therioma
inflammation 35 11 46
therioma 3 67 70
Total 38 78 116

[Link] Running the command`

Open the data file “[Link]”. Because it is the raw data, there is no need to
define the frequency variables. The process is as follows:
Analyze
Descriptive Statistics
Crosstabs
Row(s): diag_CT
Column(s): diag_path
Statistics
McNemar
Kappa
[Link] Reading the output
McNemar test (Figure 4-41) showed that there is no significant difference in
diagnostic results between the two methods.

Figure 4-41 McNemar Test result

The coincidence coefficient of the two diagnostic methods is k=0.740 (P < 0.001),
which indicates that the coincidence degree of the two diagnostic methods is
statistically significant and strong. Generally speaking, k ≥ 0.7 means strong degree of
anastomosis; 0.7 > k ≥ 0.4 is general; k < 0.4 means weak degree of coincidence.

Figure 4-42 The result of measure of agreement

4.4.4 χ2 Test for stratified data

[Link] Example 4-9

Doll and Hill studied the relationship between smoking and lung cancer in 709
patients with lung cancer and 709 non-tumor patients according to sex. The results are
as follows (Table 4-5). Try to do a case-control analysis of lung cancer.
Table 4-5 Sex and smoking history associated with lung cancer

Smoking male female

history case control total case control total
Smoke 647 622 1269 41 28 69
Not smoke 2 27 29 19 32 51

Total 649 649 1298 60 60 120

[Link] SPSS data format

Create a data file “[Link]”, which has eight rows and four columns. Four
variables are row variables, column variables, classified variables and frequency
variables.
The row variable is named “smoke” and marked smoking status. Here, “1” means
yes, and “2” means no. Usually row variables are selected for exposure, especially for
prospective studies.
The list of variables is called “case_ctr”, where “1” stands for case group and “2”
stands for control group. Variables are usually selected for outcome factors, such as
illness or not, especially in prospective studies.
The stratified variable is named “gender”, where “1” stands for “male” and “2”
stands for “female”.
Frequency variable: variable named “freq”, enter the 8 basic frequencies in the
above table. The data format is shown in Figure 4-43.

Figure 4-43 Data format of example 4-9

[Link] Running the command`
Select from the menu：
Data
Weight Cases
Weight Cases by: freq (Defining frequency variable)

Analyze
Descriptive Statistics
Crosstabs
Row(s): smoke (Row variable)

Column(s): case_ctr (Column variable)

Layer: gender (Stratified variable)

Statistics
Chi-square
Risk
Cochran's and Mantel-Haenszel statistics
Test common odds ratio equals： 1
Cells
Column
Total
[Link] Reading the output
(1)Stratified χ2 test: The results are shown in Figures 4-44 and 4-45. The
correlation between smoking and lung cancer is tested according to gender. The results
showed that smoking had significant correlation with lung cancer (P≤ 0.016). The
smoking rate in the case group is significantly higher than that in the control group,
suggesting that smoking might be a risk factor for lung cancer.

Figure 4-44 Frequency distribution table

Figure 4-45 The result of stratified chi-square test

(2)Stratified risk estimates: The results are shown in Figure 4-46 and are
explained below:
1) Odds Radio for: OR value and its confidence interval. Combined with Figure
4-46. For example, the OR value of the male group is:

Smoking rate in case group （/ 1  Smoking rate in case group）

OR=
/ 1  Smoking rate in control group）
Smoking rate in control group （

0.997 / 0.003
= =14.043
0.958 / 0.042

The 95% confidence interval of OR value is 3.325~59.301, excluding 1, which

is statistically different from 1. The result indicates that smoking is a risk factor for lung
cancer in men. The OR value of women is 2.466, and the 95% confidence interval is
1.172~5.188, not including 1. The result indicates that smoking is a risk factor for lung
cancer in women. The OR value of male is significantly greater than that of female, and
there is no statistical difference that needs further examination.
2) For cohort: The relative risk RR of the cohort study (prospective study) is
reported here. This case is a case control study (retrospective study), so this result is
meaningless. For the cohort study, it is assumed that the data in this case are divided
into smoking group and non-smoking group, and the observation result is whether lung
cancer occurs. The incidence of male smoking group =647/1269=0.5099, and that of
male non-smoking group =2/29=0.0690. The ratio of the two is that the relative risk of
lung cancer for men who smoke is RR=0.5099/0.0690=7.393.
It should be pointed out that in the case control study, the OR value obtained in the
Crosstabs main dialog box is correct whether the case control (case_ctr) variable is
selected into the row variable or the column variable. But for the cohort study, the
exposure factor must be selected into the row variable and the outcome variable into
the column variable, otherwise the reported “for cohort” is wrong.
(3) Consistency test of OR values of different genders: the results are shown in
Figure 4-47. The results of Breslow-Day and Tarone consistency tests show that there
are statistical differences in OR values of different genders (P<0.030), and male is
higher than female.
Figure 4-46 Stratified risk estimation

Figure 4-47 Consistency Test of OR value of the example 4-9

(4) Covariate analysis: Figure 4-48 shows the test results of Mantel-Haenszel
method (MH method) and Cochran improved MH method (CMH method). The
principle of the two methods is to test the relationship between smoking and lung cancer
with gender as covariate, that is, the relationship between smoking and lung cancer after
removing the influence of gender factors. The results showed that smoking is still
significantly associated with lung cancer after gender is excluded, further suggesting
that smoking is a risk factor for lung cancer.
Figure 4-48 Results of MH test and CMH test

(5) Mantel-Haenszel public OR value (common odds ratio) estimation: The result
is shown in Figure 4-49, which is explained as follows:
1) The public OR value is 4.524 with gender as the stratified variable, and the
difference is statistically different from 1 (P<0.001), and the 95% confidence interval
of the combined OR value is 2.417~8.467, which does not include 1.
2) In the table, “ln” is the estimated value of natural logarithms, such as ln (4.524)
=1.509; ln (2.417) =0.883; ln (8.467) =2.136.

Figure 4-49 Results of estimation of MH common OR value

4. 5 Ratio

4.5.1 Description

The ratio analysis process is to analyze the ratio of two quantitative variables, give
various statistics of the ratio, and also store the analysis results as data files.
4.5.2 Example 4-10

The data file “clinical [Link]” is used as the example 4-10. In this data file,
“GROUP” is as grouping variables, various statistics of the ratio of hemoglobin content
“HB1” to post-treatment hemoglobin content “HB2” are obtained.

4.5.3 Running the command

Analyze
Descriptive Statistics
Ratio
The dialog box of “Ratio Statistics” pops out (Figure 4-50).
Select “HB1” into the “Numerator” box , “HB2” into the “Denominator” box
and “GROUP ” into the “Group Variable” box. After selecting the variables in the box
of grouping variables, the “Sort by group variable” option is activated. The output
order of ascending and descending order is selected according to the classification
level.

Figure 4-50 The Ratio Statistics dialog box

Display results.
□Save results to external file.
★Statistics：Click “Statistics” button, the dialog box of “Ratio Statistics: Statistics”
pops out (Figure 4-51).

Figure 4-51 The Ratio Statistics: Statistics dialog box

◇Central Tendency.
□Median.
□Mean.
□Weighted Mean: That is, the mean of the molecule divided by the denominator
is equal to the mean calculated by the weight of the denominator.
□Confidence intervals:
◇Dispersion.
ADD (average absolute deviation): That is, the sum of absolute values of the
median difference between comparisons divided by sample size.
COD(coefficient of dispersion): That is, ADD divided by the median compared.
PRD(price-related differential): Also known as the return index, that is, the
average divided by the weighted average.
Median Centered COV (median-centered coefficient of variation): That is, the
root mean square of the difference between the comparisons and the median divided by
the median and expressed as a percentage.
Mean Centered COV (mean-centered coefficient of variation): The coefficient
of variation commonly used is divided by the contrast standard deviation by the mean.

Figure 4-52 Output of the example 4-10

4. 6 P-P Plots/Q-Q Plots

4.6.1 Description

Both P-P Plots and Q-Q Plots are used to test whether the probability distribution
of a sample is dependent on a certain theoretical distribution. The principle of P-P Plots
is to test that the difference between the actual cumulative probability distribution and
the theoretical cumulative probability distribution is symmetrical to zero. The principle
of Q-Q Plots of the axis is to verify whether the actual quartile matches the theoretical
quartile. If it is consistent, the divergence should be around a straight line, or the
difference between the actual quartile and the theoretical quartile should be distributed
in a band symmetrical to the horizontal axis of 0.

4.6.2 Example 4-11

The data file “diameter_sub.sav” is used as the example 4-11. Please check the
normality of the variable “trueap_mean”.

4.6.3 Running the command

Select from the menu:

Analyze
Descriptive Statistics
P-P/Q-Q
The dialog box of “P-P Plots/Q-Q Plots” pops out (Figure 4-53).

Figure 4-53 The P-P Plots/Q-Q Plots dialog box

◇Variables.
◇Test Distribution: There are 13 kinds of distribution to choose from. After
selecting a certain distribution, if the distribution involves degrees of freedom, the
following df (degree of freedom) frame is activated and filled in.
◇Distribution Parameters: The system defaults Estimate from data. If this is not
selected, the Location and Scale boxes are activated and the corresponding parameters
are filled in.
Verifiable distributions are Beta, Chi-square, Exponential, Gamma, Half - Normal,
Laplace, Logistic, tLognormal, Normal, Pareto, StudentT, Weibull and Uniform
distribution.
◇Transform.
□Natural log transform.
□Standardize values.
□Difference: Non-seasonal difference transformation, that is, the difference
between two consecutive data to replace the original data. Enter a positive integer to
determine the difference. If you enter 2, then the first two data systems in the new
variable default.
□Seasonally difference: Seasonal difference transformation is used to calculate
the data difference between two constant intervals in time series, and the data interval
depends on the period.
Current Periodicity: None: The current time period. The system defaults to none.
◇Proportion Estimate Formula: Here are four formulas to choose from.
Blom's: (r-3/8)/(n+1/4). N is the sample size, r is the rank, from 1 to n, the
following is the same.
◎Rankit: (r-1/2)/n.
◎Tukey's: (r-1/3)/(n+1/3).
◎Van der Waerden's：r/(n+1).
◇Rank Assigned to Ties.
Mean.
◎High.
◎Low.
◎Break ties arbitrarily.
Process:
Analyze
Descriptive Statistics
P-P/Q-Q
Variables: trueap_mean
Test Distribution：Normal
Estimate from data
The results are shown in Figures 4-54~58. The position parameters of P-P normal
probabilistic cumulative probability distribution are 14.4421, and the measured values
are 0.71728.

Figure 4-54 P-P normal probability cumulative probability parameter

After synthesizing the pattern, we can directly infer the data from normal
distribution. Accurate statistical inference, however, requires quantitative
representation, see the section on normality testing for details.

Figure 4-55 Normal P-P Plot

Figure 4-56 Detrended Normal P-P Plot

Figure 4-57 Normal Q-Q Plot

Figure 4-58 Detrended Normal Q-Q Plot

Ying Guan (Southern Medical University)

Common questions