BT1101 Business Analytics Cheatsheet

Q: What are the assumptions and limitations of using the ANOVA test for comparing group means?

ANOVA assumes data independence, normal distribution within groups, and homogeneity of variance across groups to validly compare population means . These assumptions ensure the F-distribution accurately models variance patterns. Limitations include sensitivity to non-normality and unequal variances if sample sizes are unequal, which may lead to false results. Despite robustness to normality departures, ANOVA's validity can be compromised by unequal variances, suggesting the use of alternative tests like Welch's ANOVA when variances are unequal and sample sizes differ .

Q: Describe the use and implications of the shadow price in the context of linear optimization problems.

The shadow price, or dual value, represents the change in the objective function per unit increase in the right-hand side of a constraint, provided all else remains constant . It quantifies the benefit of relaxing a binding constraint slightly. A positive shadow price indicates that increasing resource availability improves the objective, while a negative shadow price means reducing resource limits enhances results . Understanding shadow prices aids in decision-making, allowing managers to prioritize resource adjustments that most effectively optimize outcomes or inform budget allocations .

Q: How does a contingency table differ from a pivot table in data analysis?

A contingency table is used to display the frequency distribution of variables, showing the number of observations for different categories of two or more categorical variables. It requires that categories be mutually exclusive and exhaustive, covering the entire dataset . A pivot table, on the other hand, allows for dynamic cross-tabulations of data, enabling exploration of large datasets through slicing, filtering, and aggregating data in various ways .

Q: How do sample size and variability influence the reliability of a confidence interval?

The reliability of a confidence interval is affected by both sample size and variability. A larger sample size decreases the standard error, narrow the interval range, and provides a more precise estimate of the population parameter . High variability increases the standard error, widening the interval, and introduces more uncertainty in estimation. Reducing variability through stringent data collection or transforming data can improve interval precision, while ensuring adequate sample size enhances the statistical reliability of the interval .

Q: How does the F-test evaluate variance equality between two samples, and what assumptions are inherent in its application?

The F-test evaluates whether two sample variances significantly differ by comparing their variance ratio. The larger variance is placed in the numerator to form the test statistic F = s1²/s2² . Assumptions include independent, normally distributed populations from which samples are drawn, and equality of variance as the null hypothesis. The F-test requires careful validation of these assumptions to ensure reliable results. If violated, alternative tests like Levene's test, which do not assume normality, might be used to analyze variance equality between samples .

Q: How can hypothesis testing errors such as Type I and Type II occur, and what strategies exist to mitigate them?

Type I error occurs when the null hypothesis is rejected despite being true, with its probability denoted by α, which is predetermined by the researcher . Type II error occurs when the null hypothesis is not rejected when it is false, with probability β influenced by sample size and true distribution . Strategies to mitigate these errors include properly setting α to balance Type I and II risks, increasing sample size to raise the test's power (1-β), and selecting appropriate test statistics to improve sensitivity .

Q: What role does the Central Limit Theorem (CLT) play in understanding the sampling distribution of the mean?

The Central Limit Theorem (CLT) states that the sampling distribution of the sample mean will approximate a normal distribution as the sample size becomes large, regardless of the dataset's original distribution. This property allows statisticians to make inferences about population parameters even from non-normal datasets, using the mean and standard error for estimation. The CLT is crucial for constructing confidence intervals and hypothesis testing when the population distribution is unknown .

Q: What statistical measures and concepts can help identify outliers in a dataset, and how should they be handled?

Outliers can be identified using the Z-score method, where a Z-score magnitude greater than 3 suggests an outlier, or the IQR method, where data points outside 1.5 times the IQR from Q1 or Q3 are considered outliers . Once identified, outlier management involves determining their cause and necessity. Important outliers related to key data characteristics should be kept, whereas those caused by errors might be corrected or removed. Statistical analysis should consider the sensitivity of mean and range to outliers, using robust measures like median or IQR if outliers significantly affect results .

This document provides a cheatsheet summarizing key concepts in business analytics. It defines common terms like data, metrics, measures, and databases. It also outlines the business analytics process of defining problems, analyzing data, interpreting results, and implementing decisions. Statistical and modeling techniques are summarized, including descriptive statistics, probability, experimental design, and statistical tests like ANOVA. Visualization tools like dashboards, pie charts, and other charts are also covered.

Uploaded by

Random Dude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

137 views3 pages

BT1101 Business Analytics Cheatsheet

Uploaded by

Random Dude

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Data Analytics
Regression and Forecasting

lOMoARcPSD|10039096

Final cheatsheet - Everything combined

Introduction to Business Analytics (National University of Singapore)

StuDocu is not sponsored or endorsed by any college or university

Downloaded by Good Lee (goodlee20@[Link])
lOMoARcPSD|10039096

[Link] Problem (Involves experimentation/ solution Type 1 Error: Reject H0 when True (P=α) Type 2 Error: Fail to reject H0 when False
Data: numerical/ textual facts and figures collected through Empirical Rule; For a normally distributed data set, the
some type of measurement process; e.g. evaluating scenarios, analysing risks of (P=β). α is set by us, β cannot be controlled; The further the true mean from the
Data Set: A collection of data alternatives, finding and getting proportion of values that lie within k stdevs from mean is; hypothesised mean, the smaller the value of β; General inverse r/s between α & β.
Database: A collection of related files containing records on [Link] results & Decision Making , Understand limits k=1: 68%, k=2: 95%, k=3: 99.7%
people, places or thing of models and their underlying assumptions, add Power of test: 1-β, probability of not committing a type 2 error . Done by increasing
Information: result of analysing data, i.e. extracting meaning judgement to make decisions .6) Implementation
from data to support evaluation and decision making. Is 𝑈𝑝𝑝𝑒𝑟 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑙𝑖𝑚𝑖𝑡−𝑙𝑜𝑤𝑒𝑟 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑙𝑖𝑚𝑖𝑡 sample size; which allows detection of small differences between sample stats and
translating results of model back to reality; (e.g. modifying pop parameters with more accuracy
usually organised in a 2D table, where columns correspond policies, etc) between (𝑚𝑒𝑎𝑛+3𝑠𝑑) and (𝑚𝑒𝑎𝑛−3𝑠𝑑)
to each individual element of data (fields/attributes), and
rows represent records/entity of related data elements Standardised Value/Z-Score; Tells us the relative Test for Equality of Variance; F-TEST: Assumption: Both samples are drawn from
Value: Financial (MktShare, π, etc) Non-Financial distance of an observation from the mean in terms
Big Data: Large amt of business data from many sources, (Health, QOL, Satisfaction, etc) Stakeholders: Internal (Staff, 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛−𝑚𝑒𝑎𝑛 normal populations H0: sd12 – sd22 = 0 ,H1: sd12 – sd22 != 0 Test statistic: F = s12 /s22 ;
much of which is available in real time, and uncertain or Mgt, Biz owners); External (Citizens, of standard deviations 𝑍 = Population with larger variance in numerator, associated with 2 degrees of freedom,
unpredictable. Suppliers, Shareholders, Customers, Biz Partners) 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑑
4Vs: Volume, Variety (Types), Velocity (real-time vs static 1 for pop1 and another for pop2.
BSC: Objectives; Strategy Maps Continuous
data) & Veracity(Certainty) Size is a moving target, depends improvement activities implement for success Coefficient of Variation (CV); formula = (SD/ ANOVA: Compare means of ≥ 2 population grps H0: Means of all population grps are
on available tech capabilities; is relative to them Measures track org. performance Targets aims for each MEAN) relative measure of dispersion in data equal; Test Statistic F distribution; Assumptions 1) Independently & Randomly
measure Strategic Initiatives projects that help us hit targets relative to the mean; Can be used to measure risk obtained data (easily validated) 2) Normally Distributed (Normality); although not
Metric: A unit of measurement that provides a way 4 Perspectives (Financial or Stewardship: Fin Perf, Effective to return; 1/CV; return to risk, like Sharpe Ratio super impt 3) Have equal variances (Homogeneity); If sample sizes are equal, violation
to objectively quantify performance (sales,rev,profit) Resource use Cust & Stakeholders: Cust Value, Satisfaction/ Measures of Shape; Kurtosis/Skew Describes the of the third assumption does not have serious effects, but with unequal sample sizes,
Measurement: act of obtaining data associated with a Retention Internal Process: Efficiency, Quality Org Capacity/
metric (collecting data) Learning & Growth: Human K, Culture, Infras & Tech lack of it can - use t-test instead for 2 pops! . ANOVA is fairly robust to departures from
1 normality(still works if not normal)
Measures: Numbers associated with a metric ($50) symmetry; 𝑐𝑜𝑒𝑓𝑓 𝑜𝑓 𝑠𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = ∑(𝑜𝑏𝑠 −
Descriptive Anal: Convert data to info to understand 3 𝑁∗𝜎3
Reliability (accuracy and consistency) Validity 𝑚𝑒𝑎𝑛) Probability is the likelihood an outcome occurs; Classical Def: Prob can be deduced
past& current perf; manipulate & summarise data;
(data correctly measures what it is supposed to measure) Plotting on charts & data visualisation Statistics: summary |CS|>1; High degree of skewness 0.5≤|CS|≤1; from theoretical arguments; Relative Freq Def: Prob is based on empirical data (use
Discrete Metric: One derived from counting something (E.g. measure of data Descriptive statistics methods that describe Moderate skewness, |CS| < 0.5; relative symmetry; freq distribution) Subjective Def: Prob is based on judgement and experience
whether a delivery is on time or not, and summarise data For +vely skewed (tail right): mode<median<mean Experiment is a process that results in an outcome; Outcome of an experiment is a
Continuous Metric based on continuous scale of Dashboards: Visual representation of key biz measures; For -vely skewed (tail left): mean<median<mode result we observe; Sample Space is the collection of all possible experimental
measurement (E.g. $, time, vol, etc) provides important summaries of biz info outcomes Event is a collection of ≥1 outcomes from a sample space
to help manage processes Reasons: Easy comparison of sales
Categorical: sorted into cats according to characteristics; no of different products; Can identify trends; patterns, Kurtosis Describes Peakedness/Flatness of a Histo
cat is better than another (E.g. M/F) Ordinal Data is ordered 1 Random Variable (RV) is a numerical description of an experimental outcome
according to some relationship to one another Interval
exceptions 𝐶𝑜𝑒𝑓𝑓 𝑜𝑓 𝑘𝑢𝑟𝑡𝑜𝑠𝑖s = 4 ∑(𝑜𝑏𝑠 − 𝑚𝑒𝑎𝑛)4 Discrete RV: Number of possible outcomes can be counted Continuous RV: outcomes
𝑁∗𝜎
Ordinal, but with constant differences between points. Pie-charts: Display relative proportions Precautions: 2-3 over ≥1 continuous intervals of real numbers (changes overtime) Probability
No meaning to 0 (E.g. Date) Ratio continuous, have a natural Categories, Numbers all add up to 100%, labels with group CK < 3: Data is somewhat flat with wide degree of Distribution: Characterises the possible values a RV can take along with the
0 (E.g. Cost) names + actual %; avoid 3D charts dispersion CK>3: Data is somewhat peaked w/ lesser probability of getting them
dispersion
Categorical (Nominal) Scale: Property <Identity>,Eg: Gender. Bar-Charts: Length α to value of variable Clustered Bar-Chart [Discrete] Prob Mass Function Specifies the probability of random variable X, Xi
Ordinal Scale: Properties <Identity, and Magnitude>, +: Comparing categorical/ordinal data, illustrating diff Association: Correlation /Covariance;Measure of
between value sets, showing proportions/ percentages of a represents Ith value of X Cumulative Distribution F(x) = P(X≤x) [Continuous] Prob
Example: the results of a horse race / order number whole linear association between 2 variables, X & Y Density Function: 1. f(x) ≥0 for all x; 2. Total area under graph equals 1; 3. P(X=x) = 0;
Interval Scale: Properties <Identity, Magnitude, and Equal ∑(𝑜𝑏𝑠(𝑥)−𝑚𝑒𝑎𝑛 𝑥)(𝑜𝑏𝑠(𝑦)−𝑚𝑒𝑎𝑛 𝑦)
𝑐𝑜𝑣(𝑋, 𝑌) = ( 𝑁 𝑂𝑅 𝑁−1 (𝑓𝑜𝑟 𝑝𝑜𝑝,𝑠𝑎𝑚𝑝𝑙𝑒) ) 4. Probability defined only over an interval
Distance>, Example: Survey rating- Strongly Agree, Agree, Line: Good for TS Precautions: Value magnitude should not
Neutral, Disagree and Strongly Disagree/ date differ greatly; else create separate charts Normal Distribution 2 Parameters, σ & x̅; 1. Symmetric; 2. Mean=Median=Mode; 3.
Ratio Scale: Properties <Identity, Magnitude, Equal Distance, Area Charts: + More info than pie & line – can clutter
observer’s mind with too much details if too many Correlation: divide by s1s2 or α1α2 ; not dependent Range of X is unbounded; 4. Empirical rules apply Standard Normal Distribution σ =
and Absolute/True Zero>, Example: item cost/quantity/age data series are used Scatter R/S between 2 variables Bubble on units; AKA Pearson (corr. coeff /moment corr) 1, x̅ = 0, variable defined as Z Methods to verify fit goodness: Chi-sq (need at least 50
Size of data marker corresponds to a value of 3rd variable; data pts) Kolmogorov-Smirnov(small samples) Anderson-Darling (greater weight on
BA uses data, IT, statistical analysis, quantitative methods, can plot 3 variables in 2Ds Stock Stock prices Surface 3D data diff between distribution tails) Shapiro Normality
math/computer-based models to help managers gain better Doughnut like pie, but can have > 1 data series Radar Outliers; Mean & Range is sensitive to them; 1) |Z-
insight about biz ops and make better fact-based decisions multiple dimensions of a few data series Geographical Score| > 3; 2) Extreme: > 3IQR to left of Q1 or right
Estimates are measures used to estimate unknown pop parameters; Point Estimates
Benefits Lower cost, better risk management, faster of Q3; Mild is between 1.5 and 3 How? Leave in
decisions, productivity, π, satisfaction Challenges lack of Frequency Table: Table w/ no. of observations in non- data if important; Else remove them if they are is a single number from a sample used to estimate value of a pop parameter E.g. s is a
analytical skills & understanding of usage, difficult to get overlapping groups; Table the categories then point estimate of σ Unbiased if expected estimator value = pop parameter Sampling
barplot the counts Relative Frequency Dist: same as different from the rest/Correct Error in Entry Error (Depends on sample size relative to pop size, cannot be totally avoided, because
good data, data privacy, security& compliance, need right frequency table but proportions calculated instead
governance & org structure sample is just a subset) Non-sampling Error (bad sample)
𝑛∗𝑘 1 Statistical Thinking: Philo of learning & action for
Percentile: 𝑘𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 𝑟𝑎𝑛𝑘 = + improvement; based on 1) All work occurs in a
Descriptive Analytics What happened; use of data to 100 2 Sampling Distribution of the Mean (SDoM): Distribution of means of all possible
understand past and current biz performance to make Contingency Table: Tabular method displaying system of interconnected processes; 2) Variation samples of size n S.D, standard error of the mean: σ/n^0.5; so standard error
informed decisions Nobservations in a dataset for different subcats of ≥2 exists in all of them 3) Performance improves from decreases with higher n CLT: 1. Sample size large enough, SDoM approx. normally
Predictive Analytics What will happen; predict future by categorical variables; subcats must be mutually understanding and reducing this variation distributed regardless of pop distribution; mean equal to pop mean 2. If pop
examining past data, detecting patterns/relationships in exclusive and exhaustive (i.e. must constitute the distribution is normal, samp distribution is also normal for any n
data, then extrapolating them for forward in time entire data set) Pivot Table Create cross tabulations Hypothesis Testing H0: Existing Theory H1:Involves Interval Estimate range of population characteristic based on a sample; Intervals
Prescriptive Analytics How can we make it happen; to drill down a big dataset in many ways; slicers specify a range of possible values a pop can take 100(1-α) % probability interval is an
Identify best alts to maximise or minimise some objectives drawing inferences about two contrasting interval [A,B] such that the probability of falling between A&B is 1-α.
(wht optimisations u cn do to increase sales) Population All items of interest for a propositions (each called a hypothesis) relating to
Processes in Analytics Problem Solving: 1. decision/investigation Sample subset of population; the value of one or more population parameters. CI (Always check for normality, and if sample size is large use CLT, if not transform and
Recognising a Problem (gap between what is and sampling is to obtain enough info to draw a valid transform back ) is a range that the pop parameter is believed to be, along with a
what we think should happen 2. Problem Defining inference of a population One Sample Hypothesis Tests; Equality in H0! H0: prob that the interval correctly estimates the true unknown pop parameter. For mean
(Complexity increases when parameter ≤ constant; H1 > constant; H0: parameter w/ known pop sd x̅ ± zα/2(psd/root(n)) zα/2: Critical value for an upper tail area of
Nactions is big, problem belongs to group and not 1 person, Location Measures: Mean Median Mode; ≥ constant; H1 < constant; H0: parameter = constant; α/2 For mean w/unknown pop sd T-Distribution; df = sample values - est parameters;
competing objectives, external parties affected, problem Dispersion Measures: Range IQR;middle 50% of H1 != constant, choose tail based on H1 x̅ ± tα/2, df = n-1(ssd/root(n)); as n gets larger, tα/2, df = n-1 gets close to zα/2 For
solver != problem owner, time limit) [Link] Structuring data Var, S.D: root below! proportion phat ± zα/2*[phat(1-phat)/n]^0.5, phat is samp proportion
(State goals& aims, Characterise Possible Decisions,
identifying constraints/ restrictions; model needs to be x̄, p̂ is sample mean & prop, µ0, π0 is hypothesised For prediction interval (range for predicting a new observation from the same
developed
OR (N-1)/N * Var(X) values; choose tail based on H1 inequality sign population, associated with distribution of random variable) ; x̅ ± tα/2, df = n-1(ssd *
Var(X) (1+ 1/n)^0.5) To approximate sample size needed to keep errors within a margin; For
Chebyshev’s Theorem; For any data set, proportion of Alternatively: p-value; observed significance level. sample size of mean; n≥ (Zα/2* sd/E)^2 For sample size of prop; n ≥ (zα/2)^2(sample
values that lie within k stdevs of mean is 1 – 1/k2; (k >1) -ve: Reject H0 if p-value<α; p-value<(α/2) for 2 tail tests prop)(1- sample prop)/E^2) [can either use sample prop or set it to 0.5 to guarantee
usually underestimates; Applications: Process Capability precision] (E - specified level of precision)
Index (Cp): measure of how well a manufacturing process
can achieve specs; uses 3rd empirical rule; ≥1.5 is good Confidence interval and Prediction interval Model Ans The 95% in the phrase ‘95%
confidence interval’, means that about 95% of the researchers will have intervals that
Hypo Testing Model Ans: P<0.05, hence we can reject H0 and contain the population parameter Based on the confidence interval, it tells us that
conclude there is significant difference in mean satisfaction there is 95% probability that the mean median SAT for liberal arts would fall between
among the 3 educational levels. p>0.05 H0 cannot be rejected the range of [1238.61, 1274.67](GER). Based on the prediction interval, it tells us that
Hence, we conclude that there is no significant difference in there is 99% probability that the acceptance rate of a new school (a new observation
variances at 5% level of significance. Downloaded by Good Lee (goodlee20@[Link]) in the population) would fall between the range of [1.87%,74.3%].
lOMoARcPSD|10039096

Regression Analysis is a tool that builds math& stats models that Qualitative & Judgemental Forecasting relies on experience Data Mining : Simpson's Paradox, where the trend between two K-Means Clustering - Used when we don't know the labels and you wan
characterise r/s between a dependent (ratio) variable and 1 or and intuition; needed when: historical data not available/ variables (X and Y) observed when combining subgroups, reverses sign to find out through analysis. Assign each data point to a cluster that
more independent/ explanatory variables (ratio/ categorical), predictions needed into far future Historical Analogy: Look at from the trend within individual subgroups. Hidden confounders - eg. minimizes the SME. Figure out the ideal k value using "elbow" method
which are numerical. Single: 1 Independent var. Mult: ≥ 2IV past & forecast by comparing w/ prior situations [but cannot unequal distribution between groups/ difference in collection of data (Does not always give the ideal K, as sometimes it may be hard to
always account for the current reality!] Feature Selection: Use PCA (principal component Analysis) PCA 1 is a interpret the cluster.) [Link] -> plots PC2 vs PC1 (eg. taking graph
Logit regression(Categorical) vs Linear (continuous) feature that explains the most variance. Each PC is orthogonal to the from lecture with 2 cluster - the clusters are separated in the Left and
Indicators are measures believed to influence a var we each other. important to have "scale=T, centre = T". This will standardise Right(ie along PC1) but do not differ in Up and Down (ie. along PC2)).
Residuals (estimate – observed); Best-fit line minimises sum of forecast; usually combined quantitatively into an index, 1 (normalize) all your variables .PCA1$rotation - the weight of the variable
squares of residuals measure that weights multiple indicators; measure of overall Will not have any MC in PCA Classification - when alr given labels on some of them we want to lean
Standard Residual: residual/sd; outside ±2/±3 are possibly expectation. Leading Indicator measures that change before a about new data or unlabeled data (eg. we know who are the highest
outliers;
variable changes Lagging Indicator measures that change after spenders and we want to figure out their demographics --> Finding
Time Series stream of historical data; components: Random which label(demographic) contributes the most to the spending.) Uses
Standard Error variability between observed and predicted Y behaviour, trends, seasonal effect, cyclical effect; Stationary
values; aka standard error of estimate TS: Only random behaviour Cyclical vs Seasonal; C shows ups Classification matrix -> True or false = is the model right or wrong. (+ve)
Residual Standard Error is the square root of the (residual sum of and downs over a longer T frame; S repeats @ fixed intervals or (-ve) = what the model predicted.
squares, divided by the residual degrees of freedom) Generally of T Difference: PCA (have alot of features and wanna summarise but not
the lower, the better. Interpreting how big or small the residual SE worry about MC) Clustering (Like PCA but groups the people tgt.)
Trends: a gradual upward or downwards movement of a time Classification - The inverse of both (find and links labels to data)
is, requires careful attention to the units. E.g. if Y values are in the series over time. Seasonal effects: an effect that occurs/
1000s,residual SE of 10 may be very good. repeats at a fixed interval. Cyclical effects: longer-term effects High Recall:maximize true positives(catch all cancer cas).High Precision: 1. Compare two sample means with normal errors (CLT applies for samples)
that do not have a fixed interval/length. Stationarity: when Minimize False positives(minimize misclassifying healthy indiv) a. Two independent samples
Multiple R^2: (“coefficient of determination”) measures the statistical properties of the time-series (e.g., mean, variance)
do not change over time. i. [Link](y~x) # y is numeric, x is factor
proportion of the variance explained by the model. It is predicting Overfitting: occurs when a model is too complicated, such that it ii. [Link](y1, y2) # y1, y2 are numeric
about X of the data( 0.6--> 60%). Adding more variables into the predicts a specific dataset very well, but does not generalise outside iii. [Link](y~x, [Link]=TRUE) # equal variance
model (in a multiple regression) will always increase R2. When is the [Link]. - when the number of features in your model is large b. Two matched(paired) samples (when data are naturally paired )
R2 high (eg. 0.869) but none of the predictors are significant. (and close to) the number of data points you have/choosing k = N clustr i. [Link](y1, y2, paired=TRUE) (eg. estimd vol vs actual vol)
(p>0.05) - the 2 variables are highly corr. Adjusted R2 contains a Stationary TS [Only] 1) Simple Moving Average; averages We can combat this by cross-validating on held out data Use the
out random fluctuations in TS to identify direction of TS; Training Set to fit their model, Use the Validation Set to pick model 2. Compare equality of means of two or more sample groups
penalty for the number of terms in the model. hyper parameters(eg. choose k in clustering), Use Test set to Report a. one-way ANOVA: aov(y1 ~ y2) #ANOVA assumptions met
F statistic: H0: all the b’s are zero, H1: at least one of the b’s is Higher k gives smoother forecasts because extreme values
lesser impact; -: can only forecast next value 2) Exponential final prediction accuracy. Over-Fitted - High Performance on Training i. H0 : Means is the same, H1: At least one is different
non-zero. F-statistic is very large(>10), and the p-value is very Set + Low Performance on Validation Set. Optimizing model to give best ii. Post-Hoc pairwise comparisons : TukeyHSD : if p value
small (< .05), so we can reject the null hypothesis that the model Smoothing/Single Exponential Smoothing: Forecast = α* accuracy on Training Set may still result in overfitting. That is why we
Units in previous week + (1-α) *Forecast for previous week ; recommend optimising model on the Validation Set. < 0.05, means are significantly different
has no predictive power.(model is not weak, at least one b. Kruskal-Wallis rank sum test #ANOVA assumptions not met
coefficient(b) is non- zero, we want F-stat to be big) α is called the smoothing factor/coeff/constant; - If independent and identically distributed dataset, then a
This model gives more weight to more recent values; and It random split may be recommended.(2 coin tosses) i. H0 : Means is the same, H1: At least one is different
takes into account all past data 1- α: damping - If you have a time-series dataset, you may want to hold-out the ii. Post-Hoc pairwise comparisons : Pairwise Wilcox test :
Linear Regression Assumptions 1) Linearity (scatter should be most recent years (to evaluate forecasting). if p value < 0.05, means are significantly different
linear; residual plot should be random) 2) Normality of errors factor
3. Compare two variances (Fisher’s F test, [Link])
(histogram of standard residuals; but not super impt) 3) Mean Absolute deviation(MAD) = (1/n)*(sum of |At - Pt|) Regularization, you add a penalty for each term in your model, into the a. [Link](y~x) (Both data must be normal, only takes in 2 datas)
Homoscedasticity (residual should be random) 4) Residual Sum of squares(RSS) = sum of (At - Pt)2 loss function that is minimized during fitting. it requires standardized 4. Compare more than two variances
Independence of Errors 5) No outliers 6) No independent variables. [Link]() -Returns coefficients at the lambda
Mean Squared Error(MSE) = (1/n)*sum of (At - Pt)2 that gives the minimum Mean Sq Error a. [Link](y~x) (normal data and can take more than 2 data)
multicollinearity Root Mean Squared Error = ( (1/n)*sum of (At - Pt)2 )0.5 i. H0 : Variance is the same, H1: At least one is different
Residual Standard Error = ( (1/df)*sum of (At - Pt)2 )0.5 Monte Carlo : Monte Carlo Simulation involves generating random ii. if p value < 0.05, we can reject H0 and conclude that
Heteroscedasticity: (error depends on IV., errors are larger for Autocorrelation: when a variable is correlated with itself values for uncertain inputs; computing the output variables, and the variance across is diff
smaller x.), and sometimes this implies that your model is over time. Eg. Autoregressive model (of order 1, is when u repeating this for many trials to understand the distribution of the b. levene’s test (less sensitive to departures from normality))
missing some variables. output. This allows us to estimate the probability of certain desired i. H0 : Variance is the same, H1: At least one is differente
regress a dependent variable against a past version of itself: outcomes, or certain undesired outcomes. Used when experiment is
Salest = b0 + b1Salest−1 expensive/ impossible or difficult to calculate analytically or when data ii. if p value < 0.05, we can reject H0 and conclude that
Multicollinearity (MC) occurs when two or more regressors does not meet assumptions of parametric statistics the variance across is diff
are "collinear" with each other, that is, they are very highly EXP:Ŷt+1 = α (Yt + (1 − α)Yt−1 + (1 − α)2Yt−2 + …) Advantage of using empirical data: Future data might be similar to the c. Fligner-Killeen test (most robust to departures from normal
correlated with each other (rule of thumb: r > 0.7). A large NO T, No S:Simple moving average or simple exponential past data, if past data is large. Disadvantage:1) If existing data is small/ data
proportion of shared variance means that it is difficult to estimate smoothing :HoltWinters(x,beta= FALSE, gamma = FALSE) high sampling error, sampling might exacerbate the error. i. H0 : Variance is the same, H1: At least one is different
the errors, which leads to inflated estimates of errors (and 2)higher chance of obtaining outliers in sample.3)Data cannot be used ii. if p value < 0.05, we can reject H0 and conclude that
consequently, unstable models, high errors, and high p-values), T, no S; Double Exponential; First Eqn adjusts lx directly for to predict out of data's range. If existing data: we can fit it to a dist - the variance across is diff
and generally, weird findings that do not make sense. trend of previous period; eliminates lag and bring lx to the Adv: Might not be as affected by outliers and can sample from data ‘in
(1 not defined because of singularities) = perfect MC, Use appropriate base of the current value HoltWinters(x,gamma = between’ that might not exist in the existing empirical data. Set Seed;
FALSE)S, No T: Multiple Regression Model w/ categorical sampling error changes every time we sample Continuous IV's and Categorical IV's combined for a Continuous Y:
variance inflation factor to diagnose this variables for the seasonal components (k-1 dummies) UmbrellaSales = 10(Sunny) + 50Rainy + 20Cloudy + 2.5ExpenditureOnAdvertising
LM T test, interpreting the coefficients : t-statistic for a one- Linear Optimization - optimise an objective function(minimise time B0 : Average umbrella sales when Sunny(reference grp) and 0 spend on advert
sample t- test comparing coefficient to 0: we look at the ALL: Holt-Winters Triple Exponential; smoothing is applied spent / risk exposure) Feasible Region - contains solutions that satisfies b0 + b1 : Average umbrella sales when Rainy and 0 spend on advert
value of the coefficient whether it is zero or not as it tells us across seasons Season Length is the no of data points after a all the constraints & it is the intersection of all the constraint regions.
how much it affects the y value. new season Seasonal Component 1 for every point in the Corner Points - any change in decision variable will result in the b1 : Average umbrella sales when Rainy compared to Sunny(reference group) and
season: HoltWinters(x) worsening of the objective fn. Multiple sol - set of optimal solutions 0 spend on advertising (Sales when Rainy - Sales when Sunny)
Model building: Step 1) Write down your hypotheses. Select (i.e., the levelset of the optimal solution) is parallel to / lies along a b3: When Sunny(reference grp), every $ spent on advt increases sales by 2.5
the IV and Dv. 2) Check data, relationships, and assumptions. Variance Inflation Factor (VIF) It provides an index that constraint. Unbounded - When u can increase X indefinitely. No Regression based forecasting with interaction term (ex_freq ~ age * gender)
Check correlations for multicollinearity. Check distribution of measures how much the variance (the square of the estimate's Feasible solution - no overlap in the different feasable regions. Exercise Frequency = 5.8 - 0.04 Age + 3.3 genderMale - 0.11 age:genderMale
variables, amount of missing data. 3) use an analysis plan standard deviation) of an estimated regression coefficient is Sensitivity Analysis (vary coeff of obj function / [Link]/to) -
(stepwise regression, or a series of ANOVAs). 4)Evaluate and Range of objective function's coeff for which the optimal solution b0: A hypothetical Female at 0 years old would exercise 5.8hrs a week. This is
Interpret your model - Correlation != Causation. All things increased because of collinearity. (Rule of thumb is >5) - remains the same (all other coeff constant) Eg. The price we can sell significantly different from 0 at the 0.05 (or 0.01 or 0.001) level of significance.
being equal, simpler models are usually better. means there is strong multicollinearity cocktail glasses (coefficient on X3) lies between -Infinity and 341 --> in
How to solve? Either choose one, or combine them (e.g., take (Age): Every year of age reduces exercise frequency by -0.042 hours per week,
the average) order for the cocktail glasses to be profitable, we would need to for women (reference group). p=0.0003<.001, this is significantly different...
Good Model; logical, simple. Even if a variable is stat increase the selling price to at least 341 per case of cocktail glasses.
insignificant, may keep because it is due to sampling Shadow Price(Duals) - the change in the objective function (holding all (GenderMale): Males (at 0 yrs) exercise 3.34 hours more than Females (at 0 yrs).
Backwards Stepwise Regression: Start with a model with all else equal) value per unit-increase in the RHS value of that constraint.
error Overfitting: model too closely to sample data,
variables, then at each step, eliminate the "weakest" predictor (Interaction term): The decrease in exercise frequency with age for males is
cannot fit to pop; for mult regression, due to Increasing the space constraint by 1 unit (200 to 201) increases the MORE than for females; males reduce their exercise frequency by -0.11 hours/
adding too many variables Interaction occurs when 1 one-by-one until everything left is "good". Forward Stepwise profit by $0.05 (from 60 to 60.05). The Shadow Price of the second
variable is dependent on another, test by defining a Regression: Similar idea, but "forwards". Start with a model constraint is 0.05. week MORE per year that they are older. p=3.37e−13<.001, so this is significantly
new variable and testing for significance X3=X1*X2; with no predictors, then at every step, add the "best" Binding constraints- shadow price != 0, any changes in the RHS different from 0 at the 0.05 (or 0.01 or 0.001) level of significance.
Correl != interaction! It just means the effect of predictor one-by-one until everything not added is "bad" constraints will result in a change in the optimal solution. LOGIT: logit(p) ≡ b0 + b1X1 + b2X2
one of these variables on a 3rd variable differs in extent Negative shadow price - (binding) a unit increase in the RHS will b0: Log-odds when x1 and x2 = 0 . Odds of purchasing = exp(-4.0054) = 0.018
decrease the objective function
Shapiro Test Model Ans: P-value > 0.05 implies that the distribution of the data is not significantly different from normal Recommendation Memo: 1) Identify Objective Fn,Decision Var b1: Expected increase in log-odds of event per unit-increase of x1, holding x2
distribution. In other words, we can assume data is normal. Since p-value is < 0.05, we have sufficient evidence that “Number of 2)Identify Constraints 3)Write optimisation model 4) Solve graphically constant. Every unit-increase of x1 increases the log-odds by 0.3954 it multiplies
Children” is not normally distributed and can reject the null hypothesis that it is normally distributed. and/or manually 5) Use R to solve 6) Conduct sensitivity analysis 7) the odds by exp(0.3954) = 1.48. i.e., the odds increase by 48%. (not increasing
T test Model Ans: t-value = -0.3476 < Tcritical of 1.6711 Upper tail p-value = 0.6353 > 0.05 Hence, there is insufficient evidence at interpret results and write recommendation. probability, but odds)
the 5% level of significance to reject the null hypothesis and accept the claim made by the industry trade publication. Standardized coefficients:
Examine the range of coefficients over which this solution is valid. Integer Optimisation - Feasible solutions are now integer-valued
solutions that lie within the feasible region bounded by the other When Xi, increases by one sd,there is a change in Y of βi sd
(e.g., "If the profit per unit-Kale goes above $0.53 or if the profit per unit-Parsnips goes below $0.11, then please switch your constraints [Link] many shifts to assign an employee. But the nearest ANOVA MODEL SELECTION Model Ans
whole production to Kale.") Examine and interpret shadow prices (e.g. "If you get one additional plot of land, you can make feasible answer is not guaranteed to be optimal! The optimal objective To use an ANOVA, the restricted model (Y ~X1) must be a nested model within
$0.05 more profit. But if you get $1 more of budget to buy crops, you can increase your profit by $0.50 function value of the LP-relaxation is always better than (or equal to)
than the optimal objective function value of the integer-constrained the full model(Y ~ X1 + X2). It must be a "subset" of the full model. F-statistic is
How to compare nested models in R? 1)Fit each Model (1) and Model (2) -> lm(model1) / lm(model2) 2) Anova(), use F-statistic very small (<10), and the p-value is very large. We cannot reject the null
or p-value to test statistically different 3)If the test cannot reject H0 choose (2) ; otherwise stay with (1) OR If you have a large solution. The optimal integer solution may also change which variables
are Lee
relevant! Code - binary/[Link] = c(1:24) hypothesis (that b2 = 0). The ANOVA suggests that the full model ('model 2') is
set of variables to choose from, another approach is stepwise regression, which comes in two flavours: backwards and forwards
Downloaded by Good (goodlee20@[Link])
not significantly better than the restricted model.-> Pick the simpler model.

Common questions

Multicollinearity inflates the variance of regression coefficients, making them unstable and potentially leading to incorrect inferences about the relationship between variables . It increases the difficulty of determining the impact of individual predictors. Solutions include removing or combining collinear variables, using techniques like Principal Component Analysis (PCA) to reduce dimensionality, or applying penalization methods like Ridge Regression that can handle multicollinearity by including a penalty term in optimization . These methods improve model reliability and interpretability by mitigating issues due to high intercorrelation among input variables .

Skewness measures the asymmetry of a data distribution. A positively skewed distribution has a long right tail, where mode < median < mean, indicating more frequent lower values. A negatively skewed distribution, with a long left tail, means < median < mode . High skewness can distort mean interpretations and necessitate data transformation for analysis. Kurtosis measures the peakedness of a distribution. A high kurtosis (CK > 3) indicates a sharp peak with data tightly clustered around the mean, while low kurtosis (CK < 3) suggests a flat distribution with wide dispersion . Both statistics help identify the distributional shape, influencing inferential statistics and data transformations .