0% found this document useful (0 votes)
21 views7 pages

Mech 262: Data Analysis Fundamentals

Uploaded by

sisitrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

Mech 262: Data Analysis Fundamentals

Uploaded by

sisitrash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction

Terminology:
● Population: Entire data set
● Sample: Subset of population
● Sample Space: all possible outcomes of data set
● Discrete: Fixed number of options
● Continuous: Infinite number of options
● Random/Stochastic variable: assigned number to identify outcome
● Distributions
○ Symmetric
○ Uniform
○ Bimodal
○ Skewed
○ J-Shaped
● Stochastic process
○ Random process
Describing data set:
● Central tendency
○ Mean
○ Median
○ Mode
● Dispersion
○ Standard deviation (root mean square)


Normal distribution:
● Also called gaussian distribution or bell curve
● Relates mean to standard deviation


Probability axioms:
● Probability: Likelihood that event will happen
● Axiom 1: probability is between 0 and 1
● Axiom 2: P=1 means event must happen
● Axiom 3: sum of all probabilities equals 1
Probability rules:
● Mutually exclusive
○ Events cannot occur at same time
○ RULE: 𝑃(𝐴∪𝐵) = 𝑃(𝐴) + 𝑃(𝐵)
○ eg. Flipping coin and getting heads or tails
● Mutually inclusive
○ Two event may or may not occur together
○ RULE: 𝑃(𝐴∪𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴∩𝐵)
■ So we don’t double count overlap
● Independent events
○ Outcome of A doesn’t influence B
○ RULE: 𝑃(𝐴∩𝐵) = 𝑃(𝐴) × 𝑃(𝐵)
○ eg. Flipping two coins separately
● Dependent events:
○ Also called conditional probabilities
○ Probability of A given B happening
○ Denoted: 𝑃(𝐴|𝐵)
○ RULE: 𝑃(𝐴∩𝐵) = 𝑃(𝐵) 𝑃(𝐴|𝐵)
Probability distribution:
● Probability mass functions (PMF)
○ For discrete random variables
○ Mean: µ = Σ𝑥𝑖 𝑃(𝑥𝑖)
2 2
○ Variance: σ = Σ(𝑥𝑖 − µ) 𝑃(𝑥𝑖)
● Probability density function (PDF)
○ Since infinite number of outcomes, probability of given outcome is 0
■ ∴intervals must be used
○ Use integral instead of sums
○ 𝑃(𝑎 ≤ 𝑥 ≤ 𝑏) = ∫𝑓(𝑥) 𝑑𝑥
○ Mean: µ = ∫𝑥 𝑓(𝑥) 𝑑𝑥
2
○ Variance: σ = ∫(𝑥 − µ) 𝑓(𝑥) 𝑑𝑥
● Cumulative distribution function (CDF)
○ Use probability function to determine probability of event in certain range
■ Adjust bonds of integration

Discrete probability distributions


● Types
○ Binomial distribution
○ Poisson distribution
Choose notation:
● Permutations
○ Denoted: nPk
○ Order matters
○ Number of unique ordered sets
● Combinations
○ Denoted: nCk
○ Order independent
○ Number of unique non-ordered sets
Binomial distribution:
● Used for only 2 possible outcomes
○ Eg. pass or fail
○ Probability of certain outcome is process is repeated x amount of times

● where p is probability of desired event, k is #of desired events, n is


total number of events
Poisson distribution:
● Probability of given events in fixed interval of time or space
● Used for
○ Constant mean rate
○ Events independent of time since last event
○ Time or spatially discrete problems
● Assumptions:
○ Probability of event is independent of time interval (given same length of time)
○ Probability of event is independent of other events
𝑘 −λ
λ𝑒
● 𝑃(𝑘) = 𝑘!
where k is number of time desired event occurs, λ is probability of event in
given time interval of interest

Continuous probability distributions


Types:
● Normal/gaussian distribution
● Standard log-normal distribution
● Exponential distribution
Normal/gaussian distribution:
● Symmetric and negative
● Models random error


○ But hard to solve so use change of variables
(𝑥−µ)
○ 𝑧= σ
■ How many standard deviations away from mean
○ Set standard integral:


■ Set z1=0 to make it a single variable function
■ Use tables
● Using matlab to find probability
○ p=normcdf(z) where z is change of variable OR
■ Form -∞ to point (NOT 0)
○ p=normpdf(x, μ, σ)
■ Form -∞ to point (NOT 0)
Standard lognormal distribution:
● Strictly positive and occasionally very large
○ eg. Lifetime of equipment
● Logarithm that is normally distributed
○ Take ln of variable apply standard normal distribution
Exponential distribution
● Likelihood of event increase or decreases exponentially with time
−λ𝑥
● PDF function: 𝑓(𝑥, λ) = λ𝑒 where λ is rate parameter
1
● Mean (μ) = standard deviation (σ) = λ
−λ𝑥1 −λ𝑥2
● CDF: 𝑃(𝑥1 ≤ 𝑥 ≤ 𝑥2) = 𝑒 −𝑒

Test for normality:


● Test how normal an experimental distribution is
● 2 measures
○ Skewness: skewness of data
■ Denoted S
■ 𝑆 = 0 is perfectly symmetric
■ |𝑆| > 1 ⇒ very skewed
○ Kurtosis: sharpness of data (also indication of tails/extremes)
■ Denoted K
■ 𝐾 = 3 is perfect normal distribution
■ note: Some programs subtract 3 automatically
■ |𝐾 − 3| > 3 ⇒ very sharp/blunt
● Standardized moment chart to describe data relative to normal distribution

Population and samples


Distribution parameter estimation:
● Impractical to measure all population
○ ∴usually measure only sample
● How do we determine if our sample accurately represents the mean
● Central limit theorem
○ Distribution of sample means is normally distributed
■ For n>30 (#of data points per sample)
■ Regardless of population distribution
○ ↗ #of a data points (n) per sample →
○ ↘ standard deviation of, mean of samples
○ Properties
■ If population is normally distributed → samples mean ( ) is normally
distributed
■ If #of data points per sample (n)>30 → samples mean ( ) is normally
distributed

■ If n>30 →
Interval estimation of mean:
● Determining error in our sample mean

○ δ is confidence interval
○ Standard in 95% confidence interval (ASME standard)
● Confidence level (C)
○ Probability that population mean (μ) lies within confidence interval
○ 𝐶 = 1 − α where α is level of significance
○ C is % chance event will happen
○ α is % chance event will not happen
● Assume standard deviation of sample is equal to standard deviation of population
○ 𝑆= σ
𝑧α/2𝑆
● δ= where α is significance
𝑛
○ To find zα/2 reverse the z process using tables
○ Using matlab
■ z = norminv(p)
■ Linkes probability to z value
○ z1 is at norminv(α/2) AND z2 is at norminv(C+α/2)
● One-sided intervals
○ Only interested in upper of lower limit

■ Upper:

■ Lower:
○ DON’T divide α by 2 since all area (probability) is on one side
Student’s t-distribution:
● Use when n<30
● Same procedure as normal distribution BUT
○ Use t instead of z
○ Matlab:
α
■ tinv(p, nu) where 𝑝𝑢𝑝𝑝𝑒𝑟 = 𝐶 + 2
■ ν (nu) is degree of freedom
● As ν→∞, distribution approaches normal distribution
● As ν→∞, distribution flattens and widens
Estimation of population variance:
2
● Use chi-squared (χ ) distribution
● Use matlab
○ chi2inv(p, nu)
2
● χ is only positive therefore bounds are:
α α
○ 𝑝1 = 2
𝑝2 = 𝐶 + 2

Correlation
Linear correlation:
● Linear correlation coefficient (rx, y)
○ 𝑟 = 1 → strong positive correlation
○ 𝑟 =− 1 → strong negative correlation
○ 𝑟 =± 0. 1 → no correlation
● Only provides data on correlation
○ NO slope
○ NO non-linear correlation
● Matlab:
○ corr(x, y) where x and y are arrays of values
● Significance of linear correlation coefficient
○ ↗ data points → ↗ significance
○ Table gives minimum correlation coefficient needed to accept correlation
○ Depends on
■ #of data points sampled
■ Significance level wanted (α) (%that correlation is due to pure chance)

Correlation and causation:


● Correlation NOT causation

Common questions

Powered by AI

The Central Limit Theorem (CLT) is significant because it states that for sufficiently large sample sizes (n>30), the distribution of sample means approximates a normal distribution, regardless of the population's distribution . This allows statisticians to make inferences about population parameters using sample data. It also implies that as the number of data points per sample increases, the standard deviation of the sample mean decreases, which enhances the precision of estimated population parameters .

Measuring an entire population is often impractical due to the large size and resource constraints in terms of time and cost . Statistical sampling provides a solution by selecting a representative subset, allowing for reliable inferences about the population. The Central Limit Theorem ensures that the distribution of sample means approximates a normal distribution for large samples, facilitating estimation of population parameters . This approach maintains efficiency and accuracy in statistical analysis and decision-making while reducing unnecessary expenditures.

The exponential distribution is distinct in its application as it models the time between independent events that occur at a constant average rate, which suits data like time until the next phone call received . The normal distribution, being symmetric, models random error and data like the distribution of heights . The log-normal distribution models multiplicative growth processes or variables that cannot take on negative values, like asset prices . Hence, while the normal distribution is for symmetric data and log-normal for rapid growth, the exponential is for time-oriented data.

Permutations and combinations are fundamental in probability theory for counting the possible ways events can occur. Permutations (nPk) are used when the order of outcomes matters, crucial for scenarios like scheduling or sequencing tasks . In contrast, combinations (nCk) are used when order does not matter, such as when selecting team members from a group . These concepts are particularly useful in discrete distributions where specific counting of outcomes is needed, influencing the calculation of probabilities in binomial and Poisson distributions.

When using the Student's t-distribution, which is appropriate for sample sizes less than 30, the confidence interval is calculated using the t-value instead of the z-value from the normal distribution . This is because the t-distribution accounts for additional uncertainty due to the smaller sample size. As the degrees of freedom increase, the t-distribution approaches the normal distribution . The procedure involves using MATLAB commands like tinv(p, nu) to find critical t-values .

The Poisson distribution is used under the assumptions that events occur independently, with a constant mean rate, and the probability of an event happening is independent of the time since the last event . It is suitable for modeling events over a fixed interval, like the number of calls at a call center in an hour. In contrast, the binomial distribution is used for a fixed number of independent trials with only two possible outcomes (e.g., pass or fail). Poisson distributions model the number of events in a time frame, while binomials focus on success in trials.

Discrete random variables have a finite number of possible outcomes, and their probabilities are described using probability mass functions (PMFs), which sum the probabilities of each possible outcome to find the total probability . Continuous random variables, on the other hand, have an infinite number of possible values within a range, requiring intervals and the use of probability density functions (PDFs) to determine probabilities over intervals . Unlike PMFs, the probability of any single point in a PDF is zero, and so integrals are used to calculate probabilities over ranges .

Correlation coefficients quantify the strength and direction of a linear relationship between two variables; a value of 1 or -1 indicates a strong positive or negative correlation, respectively . However, they only reflect linear relationships and do not imply causation. Furthermore, high correlation might be due to a third underlying variable or chance, and outliers can disproportionately affect the coefficient. Therefore, while useful for initial data exploration, correlation coefficients must be interpreted with caution to avoid misleading conclusions .

Skewness measures the asymmetry of the data distribution. A skewness of zero indicates perfect symmetry, whereas significant positive or negative values suggest a lopsided distribution . Kurtosis measures the sharpness of the peak of the data distribution. A kurtosis of three suggests a normal distribution, while deviations indicate more extreme or less than usual tails . These measures are crucial because anomalies in skewness or kurtosis suggest deviations from normality, affecting statistical analyses assuming normal distribution properties.

Mutually exclusive events cannot occur simultaneously, and their probability calculation follows the rule P(A∪B) = P(A) + P(B) because there is no overlap . On the other hand, mutually inclusive events can occur together, and their probability is computed with P(A∪B) = P(A) + P(B) − P(A∩B), accounting for the overlap to avoid double counting . These differences underscore the importance of understanding the relationship between events when calculating probabilities.

You might also like