0% found this document useful (0 votes)
10 views4 pages

Essential Data Analysis Concepts Explained

Data 101 notes for atudying

Uploaded by

celeanasard22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

Essential Data Analysis Concepts Explained

Data 101 notes for atudying

Uploaded by

celeanasard22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Frames:

● Tabular Structure: data stored in rows and columns


● Homogeneous Columns: All the columns have the same kind of stuff in them.
Misc Stats Terms
● Null hypothesis (H0): hypothesis deemed as true until proven false by data
● Alternative hypothesis (Ha): the “opposite” of H0, you try to disprove it to prove H0
● Standard Deviation (SD, ): tells you how spread out your numbers are.
○ Higher SD = more variance in numbers/numbers more spread out
○ Lower SD = less variance in numbers/numbers less spread out
● Degrees of freedom: the maximum number of logically independent values
○ i.e. it accounts for the unknown values when dealing with a t-test
● Mean: the average of your numbers.
○ Sample mean (X̄ ): mean of a subset/sample of population
○ Population mean (µ): mean of total population
● Sample size (n) versus population (N)
● Median: The middle number in your list when they're all in order.
● Quartiles: Splitting numerical data into four equal parts.
● Range: The difference between the biggest and smallest numbers.
● Interquartile Range (IQR): How spread out the middle half of your numbers is. Often used
to find outliers in data
● Box Plot: A visual way of showing your data's range and where most of it is.
● Variance: the expected value of the squared deviation from the mean of a random variable
● Covariance: measure of the relationship between two random variables. The metric
evaluates how much the variables change together
● Correlation Coefficient: A number that says how much two sets of numbers are related.
○ Pearson Correlation:
Data ○101 Midterm
Spearman Review
Correlation: not parametric, requires some distributional assumptions
● Monte Carlo Simulation Methods: Empirically reproduces the experiment LOTS of times
and simulates the results to obtain an understanding of the process distribution. Uses a
random number generator.
○ Purpose → when exact solution or a closed form to a problem not available, likely
due to (1) the complexity of the problem/the degree of computation OR (2) the
distribution of the statistic of interest isn’t known/can’t be derived simply.
○ Procedure
1. Describe the experiment + all of its possible outcomes
2. Characterize probabilities associated with each outcome
3. Match these probabilities up with what is produced by some random
number generator
4. Generate a large number of random experiments according to this rule
Distribution types
● Probability Distributions: Tells you the chances of different things happening.
● Binomial Distribution: Figuring out how many "successes" you might have in a bunch of
tries.
● Bernoulli Distribution: Just two options, like a coin flip.
● Poisson Distribution: Counting the number of events in a specific time or space.
:
● Continuous Distribution: A smooth curve instead of separate points.
● Exponential Distribution: Figuring out how long until something happens.
● Normal or z-Distribution (): The bell-shaped curve, when you have sufficient data
○ X̄ (x bar): sample mean used to determine µ
○ µ (mu): true population parameter
○ (sigma): standard dev
● t-Distribution: like a z distribution, but when you don’t have enough data.
○ Has thicker tails at the end

○ Varies in shape depending on degrees of freedom (df).

○ The larger the sample size, the less heavy the tails, and the more the curve
approximates a normal distribution

● Chi-squared Distribution: a test to examine differences bw variables in a random sample


● F-Distribution: probability distribution of the F Statistic
○ F-statistics: used to determine whether the variance between two normal
populations is similar to one another.
Tests Edit with the Docs app
● Hypothesis Testing: Checking if your ideas about a group of numbers are right.
Make tweaks, leave comments, and share
○ H Testing using a z-test → involves comparing the means of two samples to
with others to edit at the same time.
determine if there is a significant difference between them.
● Permutation Tests: A way of checking if two groups of numbers are really different.
○ Can do permutation
NO THANKS
tests on correlations to test if variables are correlated (?)
GET THE APP
○ Permutations refer to the different ways you can arrange a set of items in a specific
order
● P-value (probability value): if p val is low, H0 must go
● T-tests : like a z-test, but performed when you have less than 30 variables
○ One-sample t-test:
○ Two-sample t-test:
○ Two-sample paired t-test:
Theorems
● Chebyshev’s Theorem: measures the dispersion of a data population
● Central Limit Theorem (CLT): If you have a bunch of numbers, the average of those
numbers will look like a bell curve.
○ Regardless of the shape of the population distribution, if I take an infinite amount of
samples, calculate their means and plot those means:

■ the distribution of the means is normal

■ the mean of the mean distribution is the population mean

■ the SD of this distribution is the population SD (Σ)

■ This is called the standard error of the mean

○ The CLT allows us to safely treat samples as normally distributed as long as:
:
■ Sample size is adequate (typically >30 subjects)

■ We can estimate population parameters based on sample statistics

Examples
● Calculation of Standard Dev (SD)
● Researcher asks a population of 5 ppl to report how many apples they eat per day.
distribution is 1, 1, 3, 5, 2. What’s the average? Standard dev?
○ Mean (n) = sum(1,1,3,5,2)/5.2 = 2.4
○ SD = sqrt(sum(Xi-2.4)^2/(5-1))

● Pay attention — different formulas standard dev of population versus sample!

○ Population Standard Dev

■ Variance =

○ Sample Standard Dev

■ Variance =

● SAT by GPA.

● 15 pairs of (x,y). x= SAT scores, y= GPA. H0: p = 0, r

1. Permute pairs: (x,y) , (x2, y2), … ( xn, yn)

a. 1,000, 10,000, etc. permutations…

2. Sampling without replacement: x = (15, 14, 13, …)


:
a. Can’t sample y’s because they were paired with x’s

b.
:

Common questions

Powered by AI

Chebyshev’s Theorem provides a conservative estimate of data dispersion, indicating that for any k > 1, at least (1 - 1/k^2) of the data values must lie within k standard deviations of the mean. This theorem is applicable regardless of the data's distribution, making it a powerful tool for assessing the extent of variability in a population .

Degrees of freedom in the t-distribution are crucial as they determine the shape of the distribution. They account for the number of values in a data set that can vary independently. As the degrees of freedom increase, the t-distribution resembles the normal distribution more closely, making it critical for smaller sample sizes where the normality assumption is less valid .

The Central Limit Theorem states that the sampling distribution of the sample mean will approach a normal distribution as the sample size becomes large, regardless of the population's distribution shape. This implies that means derived from a sufficiently large sample can be assumed to be normally distributed, allowing for standard inferential statistical procedures .

The p-value in hypothesis testing quantifies the probability of observing results as extreme as, or more extreme than, those observed under the null hypothesis. A low p-value indicates that the null hypothesis is less likely to be true, prompting a reconsideration of the assumptions. It is pivotal for determining the statistical significance of findings .

The binomial distribution models discrete variables by representing the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. Its limitations include an assumption of independence among trials and a fixed probability of success, which might not hold in real-world scenarios where probabilities can change .

Variance measures the average squared deviation from the mean, providing a quantitative view of the data's spread. It is directly related to standard deviation since the standard deviation is the square root of the variance, thus expressing variability in the same units as the data .

Monte Carlo simulations employ algorithmic techniques to reproduce an experiment numerous times to approximate the distribution of possible outcomes. The procedure involves describing the experiment and outcomes, assigning probabilities, matching these with random numbers, and then repeating random experiments many times. This method is particularly useful when the exact solution to a problem is complex or unknown .

Permutation tests are preferred when the data does not meet the assumptions required for traditional parametric tests, such as normality and equal variances. They are appropriate for small sample sizes or when the distribution is unknown, as they are based on the rearrangement of observed data rather than assumptions about population distributions .

The correlation coefficient is a statistical measure that describes the degree to which two variables move in relation to each other. Pearson correlation measures the linear relationship between two variables and requires that the data be normally distributed. In contrast, Spearman correlation is non-parametric and does not assume a normal distribution; it assesses how well the relationship between two variables can be described using a monotonic function .

The F-distribution is chiefly used to compare the variances between two populations to determine if they are equal, commonly within the context of an ANOVA framework. It requires two sets of degrees of freedom. Meanwhile, the Chi-squared distribution is generally used to test the independence between categorical variables and is extremely sensitive to sample size .

You might also like