1
Getting Started with Excel and Excel Add-Ins
(Appendix B)
Microsoft Excel is arguably the most widely used computer application among business
professionals. Accountants, economists, financial analysts, marketers, HR managers, and many
others use Excel spreadsheets for everyday business tasks. Oftentimes, these tasks involve
entering, editing, and formatting data as well as performing data analysis.
Excel Add-Ins
In this course, we use the Analysis ToolPak add-in for statistical analysis. This add-in comes
preinstalled with Excel and can easily be activated. It needs to be activated only once.
Activating Analysis ToolPak Add-Ins
a. For MS Windows, In Excel, go to File > Options > Add-Ins. For MacOS, go to Tools >
Excel Add-ins and continue to step c.
b. In the Manage Excel Add-ins section (toward the bottom of the screen), click Go…
c. On the Add-ins dialog box, check the Analysis ToolPak Add-In box and click OK.
d. In Excel, go to the Data tab and verify that the Data Analysis features appear.
Formulas
In Excel, we use formulas to perform basic calculations. When we enter a formula in a cell,
Excel performs the specified calculation and returns the result in the same cell. We also use
formulas to manipulate the cell content such as rounding a number.
Relative, Absolute, and Mixed References
By default, cell addresses in a formula, such as = B1+C1, are relative references and will change
when a formula is copied to another cell. For example, if we enter the formula =B1+C1, in cell
A1 and copy it to cell D4, the formula in cell D4 will appear as =E4+F4.
Absolute references allow us to maintain the original cell references when a formula is copied
elsewhere. We specify absolute references by adding a dollar sign ($) in front of the column
name and row number (e.g., $B$1).
We use mixed references by adding a dollar sign ($) in front of either the column name or the
row number (e.g., $B1 or B$1), but not both. This will keep the reference to the specific column
name or the row number constant.
2
Table: Relative, Absolute, and Mixed References
Reference Formula in cell A1 Formula in cell A1 copied to cell D4
Relative =B1+C1 =E4+F4
Absolute =$B$1+$C$1 =$B$1+$C$1
Mixed =$B1+$C1 =$B4+$C4
Mixed =B$1+C$1 =E$1+F$1
• Illustrated reference example next.
Functions
Functions in Excel are predefined formulas. Like a formula, a function always begins with an
equal sign (=) and must be written with the correct syntax enclosed within parentheses.
Table: Basic Descriptive Functions in Excel
Function and syntax Description Example
Returns the number of cells in the array
=COUNT(array) =COUNT(A1:A10)
with numerical values.
Returns the number of cells in the array
=COUNTA(array) =COUNTA(A1:A10)
that are not blank.
Returns the number of cells in the array
=COUNTBLANK(array) =COUNTBLANK(A1:A10)
that are blank.
Returns the number of cells in the array
=COUNTIF(array, criteria) =COUNTIF(A1:A10, “>10”)
that meet a specific selection criteria.
=IF(logical statement, =IF(A1=“Yes”, 1, 0). If A1 =
result if the statement is Returns a result based on the outcome “Yes”, returns a 1. If not, returns
true, result if the statement of the logical statement. a 0.
is false)
Adds and returns the sum of the =SUM(A1:A10)
=SUM(array)
numbers in the array.
=VLOOKUP(lookup
value, reference table, Searches and retrieves information
column number in the from a specified column in a reference
reference table containing table.
results)
• Illustrated VLOOKUP example next.
3
Reference Example
In Examples 1 and 2, insert a formula for the tax amount for January that can be copied and
pasted to other months.
Example 1:
Tax Rate
15%
Month Earnings Tax
January $70,000
February $68,000
March $72,000
April $74,000
May $73,000
June $84,000
July $96,000
August $90,000
September $76,000
October $64,000
November $84,000
December $106,000
Total
Example 2:
Tax Rate
15% 20% 25%
Month Earnings Tax 1 Tax 2 Tax 3
January $70,000
February $68,000
March $72,000
April $74,000
May $73,000
June $84,000
July $96,000
August $90,000
September $76,000
October $64,000
November $84,000
December $106,000
Total
4
VLOOKUP Example
Consider the following averages in GSB 512.
Student Average
1 75.75
2 77.77
3 90.99
4 82.38
5 90.94
6 83.26
7 67.83
8 64.95
9 72.06
10 84.97
1. Convert averages to letter grades given the following scale:
92% and above A 72% - 76% C+
88% - 92% A- 68% - 72% C
84% - 88% B+ 64% - 68% C-
80% - 84% B 60% - 64% D
76% - 80% B- < 60% F
2. Find the average for Students 1, 6, and 12
Practice Problem
The Match Data file has two worksheets. The first worksheet contains CEO compensation for
100 firms whereas second worksheet contains total asset of only 86 firms. In Column C of the
Assets worksheet, include the compensation of the corresponding 86 CEOs, matched by the firm
name. Use the VLOOKUP function for matching.
5
Data Wrangling
(Chapter 2; Sections 2.1 and 2.4 skipped)
Data wrangling is the process of retrieving, cleansing, integrating, transforming and enriching
data to support analytics. It is often considered one of the most critical and time-consuming
steps in a data analysis project. Most data analysts spend a considerable amount of time
inspecting, cleaning, and preparing the data for subsequent analysis. We will focus on three key
tasks during the data wrangling process:
• data inspection
• handle missing values and subsetting
• data transformation.
Beside visually reviewing data, counting and sorting are among the very first tasks performed to
gain a better understanding and insights into the data. Counting and sorting data help us verify
that the data set is complete or has missing values, especially in important variables. Sorting data
also allows us to review the range of values in each variable. We can sort data based on a single
variable or multiple variables.
Data Inspection Example
BalanceGig is a company that matches independent workers for short term engagements with
businesses in the construction, automotive, and high-tech industries. The ‘gig’ employees work
only for a short period of time, often on a particular project or a specific task. A manager at
BalanceGig extracts the employee data from their most recent work engagement, including the
hourly wage, the client’s industry, and the employee’s job classification. A portion of the Gig
data set is shown in the following table.
Table: Gig employee data
EmployeeID HourlyWage Industry Job
1 32.81 Construction Analyst
2 46 Automotive Engineer
. . . .
. . . .
. . . .
604 26.09 Construction Other
• Find the number of missing observations for the HourlyWage, Industry, and Job variables.
• Find the number of employees who
• Worked in the automotive industry
• Earned more than $30 per hour
• Both
• Find the lowest- and highest-paid accountants who worked in the automotive and the tech
industries.
6
Summary
▪ There is a total of 604 records in the data set. There are no missing values in the hourly
wage. The Industry and Job variables have 10 and 16 missing values, respectively.
▪ 190 employees worked in the automotive industry, 536 employees earned more than $30 per
hour, and 181 employees worked in the automotive industry and earned more than $30 per
hour.
▪ The lowest and the highest hourly wages in the data set are $24.28 and $51.00, respectively.
The three employees who had the lowest hourly wage of $24.28 all worked in the
Construction industry and were hired as Engineer, Sales Rep, and Accountant, respectively.
Interestingly, the employee with the highest hourly wage of $51.00 also worked in the
Construction industry in a job type classified as Other.
▪ The lowest and the highest paid accountants who worked in the automotive industry made
$28.74 and $49.32 per hour, respectively. In the Tech industry, the lowest and the highest
paid accountant made $36.13 and $49.49 per hour, respectively. Note that the lowest hourly
wage for an accountant is considerably higher in the Tech industry compared to the
automotive industry ($36.13 > $28.74).
________________________________________________________________________
Dealing with Missing Values
There are two common strategies for dealing with missing values. The omission strategy
recommends that observations with missing values be excluded from subsequent analysis. The
imputation strategy recommends that the missing values be replaced with some reasonable
imputed values. For numerical variables, it is common to replace the missing values with the
average (typical) values across relevant observations. For categorical variables, it is common to
impute the most predominant category. For small data sets, you can simply use Excel functions
like sorting or filtering to identify missing information and using omission or imputation
strategies. Other packages, like R, allow built in algorithms for performing such tasks.
Using Gig data, show that with the omission strategy, you are left with 578 records. Here 26
(10+16) records are removed; fewer than 26 would have been removed if both Industry and Job
variables had missing information for the same employee.
7
Subsetting
• The process of extracting portions of a data set that are relevant to the analysis is called
subsetting.
• It is commonly used to pre-process the data prior to analysis.
• Sometimes, subsetting involves excluding variables instead of observations.
• Subsetting can also be performed as part of descriptive analytics that helps reveal insights in
the data. For example, comparing subsets of medical records with different treatment results,
we may identify potential contributing factors of success in a treatment.
Example
Catherine Hill is a marketing manager at Organic Food Superstore. She has been assigned to
market the company’s new line of Asian inspired meals. Research has shown that the most likely
customers for healthy ethnic cuisines are college-educated millennials (born between 1982 and
2000). In order to spend the company’s marketing dollars efficiently, Catherine wants to focus
on this target demographic when designing the marketing campaign. With the help of the
Information Technology (IT) group, Catherine has acquired a representative sample of Organic
Food Superstore’s customers. A portion of the information is shown below.
Table A Representative Sample of Customers
CustID Gender Race BirthDate … Channel
1530016 Female Black 12/16/1986 … SM
1531136 Male White 5/9/1993 … TV
. . . . . .
. . . . . .
. . . . . .
1579979 Male White 7/5/1999 … Referral
Subset the Customers data to first identify college-educated millennial customers and then
compare the profiles of female and male college-educated millennial customers.
8
Summary
The Female worksheet contains 21 observations of female college-educated millennials and the
Male worksheet contains 38 observations of male college-educated millennials. The following
table displays a portion of the data of the Female and Male subsets.
a) Female College-educated Millennials
Sex HouseholdSize Income Spending2018 NumOfOrders Channel
Female 5 53000 241 3 SM
Female 3 84000 153 2 Web
... ... ... ... ... ...
Female 1 52000 586 13 Referral
b) Male College-educated Millennials
Sex HouseholdSize Income Spending2018 NumOfOrders Channel
Male 5 94000 843 12 TV
Male 1 97000 1028 17 Web
... ... ... ... ... ...
Male 5 102000 926 10 SM
• Female and male customers are similar in terms of household sizes and total spending in
2018.
• More high-income earners are found among the male customers than female customers.
• Male customers seem to order more than female customers do even though their total
spending is about the same.
• A large portion of the male customers were acquired through social media ads whereas
female customers were often acquired through Web ads and referral programs.
9
Transforming Categorical Variables
• We use labels or names to identify the distinguishing characteristics of a categorical variable.
Example include sex (male or female), marital status (single, married, widowed, divorced,
separated) and the performance of a manager (excellent, good, fair, poor).
• Most quantitative techniques are limited in their abilities to handle categorical data directly.
• A common transformation of categorical variables is to convert them into numerical values.
• A dummy variable, also referred to as an indicator or a binary variable, is commonly used to
describe two categories of a variable. It assumes a value of 1 for one of the categories and 0
for the other category, referred to as the reference or the benchmark category.
• For example, we can define a dummy variable to categorize a person’s sex using 1 for male
and 0 for female where females is the reference category. Alternatively, without any loss of
generality, we can define 1 for female and 0 for male, using males as the reference category.
All interpretation of the results is made in relation to the reference category.
• Oftentimes, a categorical variable is defined by more than two categories. For example, the
mode of transportation used to commute may be described by three categories: Public
Transportation, Driving Alone, and Car Pooling.
• Given k categories of a variable, the general rule is to create 𝑘 − 1 dummy variables, using
the last category as reference. In the above example, we can define two dummy variables 𝑑1
and 𝑑2 , where 𝑑1 equals 1 for Public Transportation, 0 otherwise, and 𝑑2 equals 1 for
Driving Alone, 0 otherwise. Here, Car Pooling, the reference category, is indicated when
𝑑1 = 𝑑2 = 0. Creating the third dummy variable for Car Pooling would create information
redundancy; certain analytical models cannot even be estimated with k dummy variables.
10
Dummy Variable Example
For the new Asian-inspired meal kits, Catherine feels that understanding the channels through
which customers were acquired is important to predict customers’ future behaviors. In order to
include Channel in her predictive model, Catherine needs to convert the channel categories into
dummy variables. Use the Channel variable in the Customers data to create the relevant dummy
variables. Find the number and percentage of customers in each channel.
SOLUTION:
a. Open the Customers data file.
b. Enter “Referral” in a blank column and enter the formula =IF(N2="Referral",1,0). Copy and
paste this formula to the rest of the column. Repeat the analysis for other channels.
Channel Referral SM TV Web
Number 38 39 57 66
Percentage 19.00% 19.50% 28.50% 33.00%
Scores Example
For the new Asian-inspired meal kits, Catherine wants to pay attention to customer satisfaction.
As the customer satisfaction ratings represent ordinal data, she wants to convert them to category
scores ranging from 1 (Very Dissatisfied) to 5 (Very Satisfied) to make the variable more readily
usable in predictive models. Use the satisfaction variable in the Customers data to create
category scores. Find the average score.
SOLUTION:
a. Open the Customers data file.
b. Enter “Satisfaction_Score” in a blank column. Enter the formula =IF(M2="Very Satisfied",
5, IF(M2="Somewhat Satisfied", 4, IF(M2="Neutral", 3, IF(M2="Somewhat Dissatisfied", 2,
1)))). Copy and paste this formula to the rest of the column. We find the average score of
3.745.
• Pay close attention to the parentheses in the above nested IF statement.
Practice Problems
12, 26, and 56 (answers in Appendix D)