Alteryx Inspire Conference
Field summary used to investigate data type & statistical dist.
Scatter plots & plot of means can be used for exploratory data analysis
Impute tool (handles missing or zero values) with mean as an option
Are 0 values included in the mean calculation?
P-value analysis on target variable (lower the value more significant the result)
Association measure (analysis only relevant for linear/logistic regression)
Create samples tool: creates a training/testing set
Linear regression (interactive tool provides breakdown of results). Especially look for lowest
p value indicating most relevance (statistical significance)
Intercept value (value if every other variable is zero)
OLS analysis (spread of errors will reveal model bias)
Stepwise regression (re-selects predictor variables depending on their significance)
Oversample tool (selects samples biased to a certain value)
Log normalisation (dealing with skewed data)
Log([value]+1), regression deals easier with linearised data
Confusion matrix will give values of false positives/negatives
Using false positives, we can oversample that to 50% split to train the model
Decision Tree (green: path to failure, orange represents success, Tree Classification browse
tool, if it is a yes (go to the left otherwise right)
Accuracy at each node can be shown
Union tool can also combine model objects together
Understanding Time Series
Always start with a field summary (describe())
Find any missing periods
MUST have consecutive periods between beginning and ending periods
TS Filler fills missing gaps
Green bar represents population of numeric vs. null values
TS Plots allows you to analyse time series data in terms of decomposition, auto-correlation,
partial auto-correlation
Log frequency/sample to look at relative basis over time
Clustering is an un-supervised learning technique
Udacity (predictive analytics course). Can do
Cache & run workflow (caching up till a certain point in a workflow)
Insights tool – has a built in viz platform
Putler’s Predictive Analytics Pyramid
Determine information needed to address problem/issue
Find & engineer appropriate and meaningful predictors
Relationship between predictors & target
Determine type of models needed
Meaningful metrics for prediction
Decision makers can tend to jump to a solution too soon rather than determining what information
is really needed to inform the problem/solution.
Comparing metrics from different types of models
Is it providing signal or creating noise in the model
Which predictor matters the most when making a prediction
Different modelling methods use different measures of effect size
How does predicted value change as level of numeric predictor increases or as the category changes
for a categorical predictor
For classification models – predicted probability for each possible target classes
Regression models (predicted numeric value of target)
Metrics - Regression
1. MAPE (%)
2. RMSE
3. Correlation between actual & predicted values
Metrics - Binary or Multi-Class Models
- Area under receiver operator curve (AUC) only for binary, can have multi-class extension to
it
- Confusion matrix
- Log-loss (penalise based on count)
Partial dependency plot (fitted values across range of a focal predictor)
Multi-collinearity only starts affecting the model when number of records are a lot
Reverse-causality
Efficiency
Performance
Memory
Hard drive space
Load on servers during production
Develop Efficiency
Caching
Right-click & cache to avoid re-running workflow
Reduce by sampling
Ctrl+f (in all caps, can search for values within tools)
Can load games (in ‘about’ section)
HIPPO (Highest Paid Person’s Opinion)