0% found this document useful (0 votes)
4 views164 pages

Course 4 Process Data From Dirty To Clean

Course 4 consists of six modules focusing on essential data analysis skills, including data integrity, cleaning data, and using SQL for data management. Participants will learn the importance of clean data, how to verify and report cleaning results, and tips for building a resume in data analytics. The course is designed for self-paced learning with various materials and assessments to reinforce skills.

Uploaded by

selinezlyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views164 pages

Course 4 Process Data From Dirty To Clean

Course 4 consists of six modules focusing on essential data analysis skills, including data integrity, cleaning data, and using SQL for data management. Participants will learn the importance of clean data, how to verify and report cleaning results, and tips for building a resume in data analytics. The course is designed for self-paced learning with various materials and assessments to reinforce skills.

Uploaded by

selinezlyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Course 4 content

Each course is broken into modules. Here’s a quick overview of the skills you’ll gain in
each of the six Course 4 modules.

Module 1: The importance of integrity

Data integrity is critical to successful analysis. In this part of the course, you’ll explore
methods and steps that analysts take to check their data for integrity. This includes
knowing what to do when you don’t have enough data. You’ll also learn about random
samples and understand how to avoid sampling bias. All of these methods will also help
you ensure your analysis is successful.

Module 2: Clean data for more accurate insights

Every data analyst wants to analyze clean data. In this part of the course, you’ll learn
the difference between clean and dirty data. Then, you’ll practice cleaning data in
spreadsheets and other tools.

Module 3: Data cleaning with SQL

Knowing a variety of ways to clean data can make a data analyst’s job much easier. In
this part of the course, you’ll use SQL to clean data from databases. In particular, you’ll
explore how SQL queries and functions can be used to clean and transform your data
before an analysis.

Module 4: Verify and report cleaning results

When you clean data, you make changes to the original dataset. It’s important to verify
the changes you make are accurate and to let your teammates know about the
changes. In this part of the course, you’ll learn to verify that data is clean and report
your data cleaning results. With verified clean data, you’re ready to begin analyzing!

1
Module 5: Add data to your resume

Creating an effective resume will help you in your data analytics career. In this part of
the course, you’ll learn all about the job application process. Your focus will be on
building a resume that highlights your strengths and relevant experience.

Module 6: Course wrap-up

Review the course glossary and prepare for the next course in the Google Data
Analytics Certificate program.

What to expect
Each module includes a series of lessons with many types of learning opportunities.
These include:

 Videos for instructors to teach new concepts and demonstrate the use of tools
 Step-by-step guides you can use to follow along with instructors as they
demonstrate tools
 Readings to explore topics more in-depth and build on the concepts from the videos
 Practice quizzes to prepare you for graded quizzes
 Graded quizzes to measure your progress and give you valuable feedback

This program was designed to let you work at your own pace—your personalized
deadlines are just a guide. There is no penalty for late assignments. To earn your
certificate, you simply need to complete all of the work.

In this course, you'll be assessed with graded quizzes and activities. Both are based on
the wide variety of learning materials and activities that reinforce the important skills
you’ll develop. And both can be taken more than once.

Tips for success


 It is strongly recommended that you go through the items in each lesson in the order
they appear because new information and concepts build on previous knowledge.

2
 Participate in all learning opportunities to gain as much knowledge and experience as
possible.
 If something is confusing, don’t hesitate to replay a video, review a reading, or repeat a
self-review activity.
 When you encounter useful links in this course, bookmark them so you can refer to the
information later for study or review.

Updates to the course

As you complete this course, you may notice updates to the content, like new practice
materials and additional examples. These updates ensure the program provides up-to-
date skills and guidance that will help you in your data analytics career. If you previously
completed a graded activity, you may need to repeat the assessment in order to
complete this course.

Module # 01

Introduction to data integrity

3
Hi! Good to see you! My name is Sally, and I'm here to teach you all about processing
data. I'm a measurement and analytical lead at Google. My job is to help advertising
agencies and companies measure success and analyze their data, so I get to meet with
lots of different people to show them how data analysis helps with their
advertising. Speaking of analysis, you did great earlier learning how to gather
and organize data for analysis. It's definitely an important step in the data analysis
process, so well done!
Now let's talk about how to make sure that your organized data is complete
and accurate. Clean data is the key to making sure your data has integrity before you
analyze it. We'll show you how to make sure your data is clean and tidy. Cleaning and
processing data is one part of the overall data analysis process. As a quick reminder,
that process is Ask, Prepare, Process, Analyze, Share, and Act. Which means it's time
for us to explore the Process phase, and I'm here to guide you the whole way. I'm very
familiar with where you are right now.
I'd never heard of data analytics until I went through a program similar to this one. Once
I started making progress, I realized how much I enjoyed data analytics and the doors it
could open. And now I'm excited to help you open those same doors! One thing I
realized as I worked for different companies, is that clean data is important in every
industry. For example, I learned early in my career to be on the lookout for duplicate
data, a common problem that analysts come across when cleaning. I used to work for a
company that had different types of subscriptions. In our data set, each user would have
a new row for each subscription type they bought, which meant users would show up
more than once in my data.
So if I had counted the number of users in a table without accounting for duplicates like
this, I would have counted some users twice instead of once. As a result, my analysis
would have been wrong, which would have led to problems in my reports and for the
stakeholders relying on my analysis. Imagine if I told the CEO that we had twice as
many customers as we actually did!? That's why clean data is so important. So the first
step in processing data is learning about data integrity. You will find out what data
integrity is and why it is important to maintain it throughout the data analysis
process. Sometimes you might not even have the data that you need, so you'll have to
create it yourself.
This will help you learn how sample size and random sampling can save you time
and effort. Testing data is another important step to take when processing data. We'll
share some guidance on how to test data before your analysis officially begins. Just like
you'd clean your clothes and your dishes in everyday life, analysts clean their data all
the time, too. The importance of clean data will definitely be a focus here. You'll learn
data cleaning techniques for all scenarios, along with some pitfalls to watch out for as
you clean. You'll explore data cleaning in both spreadsheets and databases, building on
what you've already learned about spreadsheets.
We'll talk more about SQL and how you can use it to clean data and do other useful
things, too. When analysts clean their data, they do a lot more than a spot check to
make sure it was done correctly. You'll learn ways to verify and report your cleaning
results. This includes documenting your cleaning process, which has lots of benefits
that we'll explore. It's important to remember that processing data is just one of the
tasks you'll complete as a data analyst. Actually, your skills with cleaning data might just

4
end up being something you highlight on your resume when you start job
hunting. Speaking of resumes, you'll be able to start thinking about how to build your
own from the perspective of a data analyst.
Once you're done here, you'll have a strong appreciation for clean data and how
important it is in the data analysis process. So let's get started!

Why data integrity is important


Welcome back. In this video, we're going to discuss data integrity and some risks you
might run into as a data analyst. A strong analysis depends on the integrity of the
data. If the data you're using is compromised in any way, your analysis won't be as
strong as it should be. Data integrity is the accuracy, completeness, consistency, and
trustworthiness of data throughout its lifecycle. That might sound like a lot of qualities for
the data to live up to. But trust me, it's worth it to check for them all before proceeding
with your analysis.
Otherwise, your analysis could be wrong. Not because you did something wrong, but
because the data you were working with was wrong to begin with. When data integrity is
low, it can cause anything from the loss of a single pixel in an image to an incorrect
medical decision. In some cases, one missing piece can make all of your data
useless. Data integrity can be compromised in lots of different ways. There's a chance
data can be compromised every time it's replicated, transferred, or manipulated in any
way. Data replication is the process of storing data in multiple locations.
If you're replicating data at different times in different places, there's a chance your data
will be out of sync. This data lacks integrity because different people might not be using
the same data for their findings, which can cause inconsistencies. There's also the issue
of data transfer, which is the process of copying data from a storage device to memory,
or from one computer to another. If your data transfer is interrupted, you might end up
with an incomplete data set, which might not be useful for your needs. The data
manipulation process involves changing the data to make it more organized and easier
to read. Data manipulation is meant to make the data analysis process more
efficient, but an error during the process can compromise the efficiency. Finally, data
can also be compromised through human error, viruses, malware, hacking, and system
failures, which can all lead to even more headaches.
I'll stop there. That's enough potentially bad news to digest. Let's move on to some
potentially good news. In a lot of companies, the data warehouse or data engineering
team takes care of ensuring data integrity. Coming up, we'll learn about checking data

5
integrity as a data analyst. But rest assured, someone else will usually have your back
too. After you've found out what data you're working with, it's important to double-check
that your data is complete and valid before analysis.
This will help ensure that your analysis and eventual conclusions are
accurate. Checking data integrity is a vital step in processing your data to get it ready
for analysis, whether you or someone else at your company is doing it. Coming up,
you'll learn even more about data integrity. See you soon!

More about data integrity and


compliance
This reading illustrates the importance of data integrity using an example of a global
company’s data. Definitions of terms that are relevant to data integrity will be provided
at the end.

Scenario: calendar dates for a global company

Calendar dates are represented in a lot of different short forms. Depending on where
you live, a different format might be used.

 In some countries, 12/10/20 (DD/MM/YY) stands for October 12, 2020.


 In other countries, the national standard is YYYY-MM-DD so October 12, 2020
becomes 2020-10-12.
 In the United States, (MM/DD/YY) is the accepted format so October 12, 2020 is going
to be 10/12/20.

Now, think about what would happen if you were working as a data analyst for a global
company and didn’t check date formats. Well, your data integrity would probably be
questionable. Any analysis of the data would be inaccurate. Imagine ordering extra
inventory for December when it was actually needed in October!

A good analysis depends on the integrity of the data, and data integrity usually depends
on using a common format. So it is important to double-check how dates are formatted
to make sure what you think is December 10, 2020 isn’t really October 12, 2020, and
vice versa.

6
Here are some other things to watch out for:

 Data replication compromising data integrity: Continuing with the


example, imagine you ask your international counterparts to verify dates and stick to
one format. One analyst copies a large dataset to check the dates. But because of
memory issues, only part of the dataset is actually copied. The analyst would be
verifying and standardizing incomplete data. That partial dataset would be certified as
compliant but the full dataset would still contain dates that weren't verified. Two versions
of a dataset can introduce inconsistent results. A final audit of results would be essential
to reveal what happened and correct all dates.
 Data transfer compromising data integrity: Another analyst checks the
dates in a spreadsheet and chooses to import the validated and standardized data back
to the database. But suppose the date field from the spreadsheet was incorrectly
classified as a text field during the data import (transfer) process. Now some of the
dates in the database are stored as text strings. At this point, the data needs to be
cleaned to restore its integrity.
 Data manipulation compromising data integrity: When checking dates,
another analyst notices what appears to be a duplicate record in the database and
removes it. But it turns out that the analyst removed a unique record for a company’s
subsidiary and not a duplicate record for the company. Your dataset is now missing
data and the data must be restored for completeness.

Conclusion

Fortunately, with a standard date format and compliance by all people and systems that
work with the data, data integrity can be maintained. But no matter where your data
comes from, always be sure to check that it is valid, complete, and clean before you
begin any analysis.

7
Reference: Data constraints and examples

As you progress in your data journey, you'll come across many types of data constraints
(or criteria that determine validity). The table below offers definitions and examples of
data constraint terms you might come across.

Data
Definition Examples
constraint

Data type Values must be of a If the data type is a date, a single number like 30 would
certain type: date, fail the constraint and be invalid
number, percentage,
Boolean, etc.

Data range Values must fall If the data range is 10-20, a value of 30 would fail the
between predefined constraint and be invalid
maximum and
minimum values

Mandatory Values can’t be left If age is mandatory, that value must be filled in
blank or empty

Unique Values can’t have a Two people can’t have the same mobile phone number

8
Data
Definition Examples
constraint

duplicate within the same service area

Regular Values must match a A phone number must match ###-###-#### (no other
expression prescribed pattern characters allowed)
(regex)
patterns

Cross-field Certain conditions for Values are percentages and values from multiple fields
validation multiple fields must be must add up to 100%
satisfied

Primary-key (Databases only) value A database table can’t have two rows with the same
must be unique per primary key value. A primary key is an identifier in a
column database that references a column in which each value is
unique. More information about primary and foreign keys
is provided later in the program.

Set- (Databases only) values Value for a column must be set to Yes, No, or Not
membership for a column must come Applicable
from a set of discrete
values

Foreign-key (Databases only) values In a U.S. taxpayer database, the State column must be a
for a column must be valid state or territory with the set of acceptable values
unique values coming defined in a separate States table
from a column in
another table

Accuracy The degree to which the If values for zip codes are validated by street location, the
data conforms to the accuracy of the data goes up.
actual entity being
measured or described

Completeness The degree to which the If data for personal profiles required hair and eye color,

9
Data
Definition Examples
constraint

data contains all desired and both are collected, the data is complete.
components or
measures

Consistency The degree to which the If a customer has the same address in the sales and repair
data is repeatable from databases, the data is consistent.
different points of entry
or collection

Balance objectives with data integrity


Hey there, it's good to remember to check for data integrity. It's also important to check
that the data you use aligns with the business objective. This adds another layer to the
maintenance of data integrity because the data you're using might have limitations that
you'll need to deal with. The process of matching data to business objectives can
actually be pretty straightforward. Here's a quick example. Let's say you're an analyst
for a business that produces and sells auto parts. If you need to address a question
about the revenue generated by the sale of a certain part, then you'd pull up the
revenue table from the data set.
If the question is about customer reviews, then you'd pull up the reviews table to
analyze the average ratings. But before digging into any analysis, you need to consider
a few limitations that might affect it. If the data hasn't been cleaned properly, then you
won't be able to use it yet. You would need to wait until a thorough cleaning has been
done. Now, let's say you're trying to find how much an average customer spends. You
notice the same customer's data showing up in more than one row. This is called
duplicate data.
To fix this, you might need to change the format of the data, or you might need to
change the way you calculate the average. Otherwise, it will seem like the data is for
two different people, and you'll be stuck with misleading calculations. You might also
realize there's not enough data to complete an accurate analysis. Maybe you only have
a couple of months' worth of sales data. There's slim chance you could wait for more
data, but it's more likely that you'll have to change your process or find alternate
sources of data while still meeting your objective. I like to think of a data set like a
picture. Take this picture. What are we looking at?
Unless you're an expert traveler or know the area, it may be hard to pick out from just
these two images. Visually, it's very clear when we aren't seeing the whole
picture. When you get the complete picture, you realize... you're in London! With
incomplete data, it's hard to see the whole picture to get a real sense of what is going
on. We sometimes trust data because if it comes to us in rows and columns, it

10
seems like everything we need is there if we just query it. But that's just not true. I
remember a time when I found out I didn't have enough data and had to find a solution.
I was working for an online retail company and was asked to figure out how to shorten
customer purchase to delivery time. Faster delivery times usually lead to happier
customers. When I checked the data set, I found very limited tracking information. We
were missing some pretty key details. So the data engineers and I created new
processes to track additional information, like the number of stops in a journey. Using
this data, we reduced the time it took from purchase to delivery and saw an
improvement in customer satisfaction. That felt pretty great!
Learning how to deal with data issues while staying focused on your objective will help
set you up for success in your career as a data analyst. And your path to success
continues. Next step, you'll learn more about aligning data to objectives. Keep it up

Well-aligned objectives and


data
You can gain powerful insights and make accurate conclusions when data is well-
aligned to business objectives. As a data analyst, alignment is something you will need
to judge. Good alignment means that the data is relevant and can help you solve a
business problem or determine a course of action to achieve a given business objective.

In this reading, you will review the business objectives associated with three scenarios.
You will explore how clean data and well-aligned business objectives can help you
come up with accurate conclusions. On top of that, you will learn how new variables
discovered during data analysis can cause you to set up data constraints so you can
keep the data aligned to a business objective.

Clean data + alignment to business objective = accurate conclusions

Business objective

Account managers at Impress Me, an online content subscription service, want to know
how soon users view content after their subscriptions are activated.

11
To start off, the data analyst verifies that the data exported to spreadsheets is clean and
confirms that the data needed (when users access content) is available. Knowing this,
the analyst decides there is good alignment of the data to the business objective. All
that is missing is figuring out exactly how long it takes each user to view content after
their subscription has been activated.

Here are the data processing steps the analyst takes for a user from an account called
V&L Consulting. (These steps would be repeated for each subscribing account, and for
each user associated with that account.)

Step 1

Data-processing step Source of data

Look up the activation date for V&L Consulting Account spreadsheet

Relevant data in spreadsheet:

Result: October 21, 2019

12
Step 2

Data-processing step Source of data

Look up the name of a user belonging to the V&L Account spreadsheet


Consulting account (users tab)

Relevant data in spreadsheet:

Result: Maria Ballantyne

Step 3

Data-processing step Source of data

Find the first content access date for Maria B. Content usage spreadsheet

Relevant data in spreadsheet:

Result: October 31, 2019

Step 4

Data-processing step Source of data

Calculate the time between activation and first content New spreadsheet
usage for Maria B. calculation

13
Relevant data in spreadsheet:

Result: 10 days

Pro tip 1

In the above process, the analyst could use VLOOKUP to look up the data in Steps 1, 2,
and 3 to populate the values in the spreadsheet in Step 4. VLOOKUP is a spreadsheet
function that searches for a certain value in a column to return a related piece of
information. Using VLOOKUP can save a lot of time; without it, you have to look up dates
and names manually.

Refer to the VLOOKUP page in the Google Help Center for how to use the function in
Google Sheets.

Pro tip 2

In Step 4 of the above process, the analyst could use the DATEDIF function to
automatically calculate the difference between the dates in column C and column D.
The function can calculate the number of days between two dates.

Refer to the Microsoft Support DATEDIF page for how to use the function in Excel. The
DAYS360 function does the same thing in accounting spreadsheets that use a 360-day
year (twelve 30-day months).

Refer to the DATEDIF page in the Google Help Center for how to use the function in
Google Sheets.

14
Alignment to business objective + additional data cleaning = accurate
conclusions

Business objective

Cloud Gate, a software company, recently hosted a series of public webinars as free
product introductions. The data analyst and webinar program manager want to identify
companies that had five or more people attend these sessions. They want to give this
list of companies to sales managers who can follow up for potential sales.

The webinar attendance data includes the fields and data shown below.

Name <First name> <Last This was required information


name> attendees had to submit

Email xxxxx@[Link] This was required information attendees


Addres had to submit
s

Compa <Company name> This was optional information attendees


ny could provide

Data cleaning

The webinar attendance data seems to align with the business objective. But the data
analyst and program manager decide that some data cleaning is needed before the
analysis. They think data cleaning is required because:

15
 The company name wasn’t a mandatory field. If the company name is blank, it might be
found from the email address. For example, if the email address is
username@[Link], the company field could be filled in with Google for the data
analysis. This data cleaning step assumes that people with company-assigned email
addresses attended a webinar for business purposes.
 Attendees could enter any name. Since attendance across a series of webinars is being
looked at, they need to validate names against unique email addresses. For example, if
Joe Cox attended two webinars but signed in as Joe Cox for one and Joseph Cox for
the other, he would be counted as two different people. To prevent this, they need to
check his unique email address to determine that he was the same person. After the
validation, Joseph Cox could be changed to Joe Cox to match the other instance.

Alignment to business objective + newly discovered variables +


constraints = accurate conclusions

Business objective

An after-school tutoring company, A+ Education, wants to know if there is a minimum


number of tutoring hours needed before students have at least a 10% improvement in
their assessment scores.

The data analyst thinks there is good alignment between the data available and the
business objective because:

 Students log in and out of a system for each tutoring session, and the number of hours
is tracked
 Assessment scores are regularly recorded

16
Data constraints for new variables

After looking at the data, the data analyst discovers that there are other variables to
consider. Some students had consistent weekly sessions while other students had
scheduled sessions more randomly even though their total number of tutoring hours
was the same. The data doesn’t align as well with the original business objective as first
thought, so the analyst adds a data constraint to focus only on the students with
consistent weekly sessions. This modification helps to get a more accurate picture
about the enrollment time needed to achieve a 10% improvement in assessment
scores.

Key takeaways

Hopefully these examples give you a sense of what to look for to know if your data
aligns with your business objective.

 When there is clean data and good alignment, you can get accurate insights and make
conclusions the data supports.
 If there is good alignment but the data needs to be cleaned, clean the data before you
perform your analysis.
 If the data only partially aligns with an objective, think about how you could modify the
objective, or use data constraints to make sure that the subset of data better aligns with
the business objective.

Deal with insufficient data


Every analyst has been in a situation where there is insufficient data to help with their business
objective. Considering how much data is generated every day, it may be hard to believe, but it's
true. So let's discuss what you can do when you have insufficient data. We'll cover how to set
limits for the scope of your analysis and what data you should include. At one point, I was a data
analyst at a support center. Every day, we received customer questions, which were logged in
as support tickets. I was asked to forecast the number of support tickets coming in per month
to figure out how many additional people we needed to hire.

It was very important that we had sufficient data spanning back at least a couple of years
because I had to account for year-to-year and seasonal changes. If I just had the current year's

17
data available, I wouldn't have known that a spike in January is common and has to do with
people asking for refunds after the holidays. Because I had sufficient data, I was able to suggest
we hire more people in January to prepare. Challenges are bound to come up, but the good
news is that once you know your business objective, you'll be able to recognize whether you
have enough data. And if you don't, you'll be able to deal with it before you start your
analysis. Now, let's check out some of those limitations you might come across and how you
can handle different types of insufficient data. Say you're working in the tourism industry,
and you need to find out which travel plans are searched most often.

If you only use data from one booking site, you're limiting yourself to data from just one
source. Other booking sites might show different trends that you would want to consider for your
analysis. If a limitation like this impacts your analysis, you can stop and go back to your
stakeholders to figure out a plan. If your data set keeps updating, that means the data is still
incoming and might not be complete. So if there's a brand new tourist attraction that you're
analyzing interest and attendance for, there's probably not enough data for you to determine
trends. For example, you might want to wait a month to gather data. Or you can check in with
the stakeholders and ask about adjusting the objective.

For example, you might analyze trends from week to week instead of month to month. You
could also base your analysis on trends over the past three months and say, "Here's what
attendance at the attraction for month four could look like." You might not have enough data to
know if this number is too low or too high. But you would tell stakeholders that it's your best
estimate based on the data that you currently have. On the other hand, your data could be older
and no longer be relevant. Outdated data about customer satisfaction won't include the most
recent responses. So you'll be relying on the ratings for hotels or vacation rentals that might no
longer be accurate.

In this case, your best bet might be to find a new data set to work with. Data that's
geographically-limited could also be unreliable. If your company is global, you wouldn't want to
use data limited to travel in just one country. You would want a data set that includes all
countries. So that's just a few of the most common limitations you'll come across and some
ways you can address them. You can identify trends with the available data or wait for more
data if time allows; you can talk with stakeholders and adjust your objective; or you can look
for a new data set. The need to take these steps will depend on your role in your company
and possibly the needs of the wider industry.

18
But learning how to deal with insufficient data is always a great way to set yourself up for
success. Your data analyst powers are growing stronger. And just in time. After you learn more
about limitations and solutions, you'll learn about statistical power, another fantastic tool for you
to use. See you soon!

When you find an issue with


your data
When you are getting ready for data analysis, you might realize you don’t have the data
you need or you don’t have enough of it. In some cases, you can use what is known as
proxy data in place of the real data. Think of it like substituting oil for butter in a recipe
when you don’t have butter. In other cases, there is no reasonable substitute and your
only option is to collect more data.

Consider the following data issues and suggestions on how to work around them.

Data issue 1: no data

Possible Solutions Examples of solutions in real life

Gather the data on a small scale to If you are surveying employees about what
perform a preliminary analysis and they think about a new performance and
then request additional time to bonus plan, use a sample for a preliminary
complete the analysis after you analysis. Then, ask for another 3 weeks to
have collected more data. collect the data from all employees.

If there isn’t time to collect data, If you are analyzing peak travel times for
perform the analysis using proxy commuters but don’t have the data for a
data from other datasets. This is particular city, use the data from another
the most common workaround. city with a similar size and demographic.

19
Data issue 2: too little data

Possible Solutions Examples of solutions in real life

Do the analysis using If you are analyzing trends for owners of golden
proxy data along with retrievers, make your dataset larger by including the
actual data. data from owners of labradors.

Adjust your analysis to If you are missing data for 18- to 24-year-olds, do the
align with the data you analysis but note the following limitation in your report:
already have. this conclusion applies to adults 25 years and older
only.

Data issue 3: wrong data, including data with errors*

Possible Solutions Examples of solutions in real life

If you have the wrong data because If you need the data for female voters
requirements were misunderstood, and received the data for male voters,
communicate the requirements again. restate your needs.

Identify errors in the data and, if If your data is in a spreadsheet and


possible, correct them at the source by there is a conditional statement or
looking for a pattern in the errors. boolean causing calculations to be
wrong, change the conditional
statement instead of just fixing the
calculated values.

If you can’t correct data errors yourself, If your dataset was translated from a
you can ignore the wrong data and go different language and some of the
ahead with the analysis if your sample translations don’t make sense, ignore
size is still large enough and ignoring the data with bad translation and go
the data won’t cause systematic bias. ahead with the analysis of the other
data.

20
* Important note: Sometimes data with errors can be a warning sign that the data
isn’t reliable. Use your best judgment.

Use the following decision tree as a reminder of how to deal with data errors or not
enough data:

The importance of sample size


Okay, earlier we talked about having the right kind of data to meet your business objective and
the importance of having the right amount of data to make sure your analysis is as accurate as
possible. You might remember that for data analysts, a population is all possible data values in
a certain dataset. If you're able to use 100 percent of a population in your analysis, that's
great. But sometimes collecting information about an entire population just isn't possible. It's too

21
time-consuming or expensive. For example, let's say a global organization wants to know more
about pet owners who have cats. You're tasked with finding out which kinds of toys cat owners
in Canada prefer.

But there's millions of cat owners in Canada, so getting data from all of them would be a huge
challenge. Fear not! Allow me to introduce you to... sample size! When you use sample size or
a sample, you use a part of a population that's representative of the population. The goal is to
get enough information from a small group within a population to make predictions or
conclusions about the whole population. The sample size helps ensure the degree to which you
can be confident that your conclusions accurately represent the population. For the data on cat
owners, a sample size might contain data about hundreds or thousands of people rather than
millions.

Using a sample for analysis is more cost-effective and takes less time. If done carefully and
thoughtfully, you can get the same results using a sample size instead of trying to hunt down
every single cat owner to find out their favorite cat toys. There is a potential downside,
though. When you only use a small sample of a population, it can lead to uncertainty. You can't
really be 100 percent sure that your statistics are a complete and accurate representation of the
population. This leads to sampling bias, which we covered earlier in the program. Sampling bias
is when a sample isn't representative of the population as a whole.

This means some members of the population are being overrepresented or


underrepresented. For example, if the survey used to collect data from cat owners only included
people with smartphones, then cat owners who don't have a smartphone wouldn't be
represented in the data. Using random sampling can help address some of those issues with
sampling bias. Random sampling is a way of selecting a sample from a population so that every
possible type of the sample has an equal chance of being chosen. Going back to our cat owners
again, using a random sample of cat owners means cat owners of every type have an equal
chance of being chosen. Cat owners who live in apartments in Ontario would have the same
chance of being represented as those who live in houses in Alberta. As a data analyst, you'll
find that creating sample sizes usually takes place before you even get to the data.

But it's still good for you to know that the data you are going to analyze is representative of the
population and works with your objective. It's also good to know what's coming up in your data
journey. In the next video, you'll have an option to become even more comfortable with sample
sizes. See you there.

22
Calculate sample size
Before you dig deeper into sample size, familiarize yourself with these terms and
definitions:

Terminol Definitions
ogy

Populatio The entire group that you are interested in for your study. For
n example, if you are surveying people in your company, the
population would be all the employees in your company.

Sample A subset of your population. Just like a food sample, it is called a


sample because it is only a taste. So if your company is too large
to survey every individual, you can survey a representative sample
of your population.

Margin of Since a sample is used to represent a population, the sample’s


error results are expected to differ from what the result would have
been if you had surveyed the entire population. This difference is
called the margin of error. The smaller the margin of error, the
closer the results of the sample are to what the result would have
been if you had surveyed the entire population.

Confiden How confident you are in the survey results. For example, a 95%
ce level confidence level means that if you were to run the same survey
100 times, you would get similar results 95 of those 100 times.
Confidence level is targeted before you start your study because it
will affect how big your margin of error is at the end of your study.

Confiden The range of possible values that the population’s result would be
ce at the confidence level of the study. This range is the sample

23
Terminol Definitions
ogy

interval result +/- the margin of error.

Statistica The determination of whether your result could be due to random


l chance or not. The greater the significance, the less due to
significan chance.
ce

Things to remember when determining the size of your sample

When figuring out a sample size, here are things to keep in mind:

 Don’t use a sample size less than 30. It has been statistically proven that 30 is the
smallest sample size where an average result of a sample starts to represent the
average result of a population.
 The confidence level most commonly used is 95%, but 90% can work in some cases.

Increase the sample size to meet specific needs of your project:

 For a higher confidence level, use a larger sample size


 To decrease the margin of error, use a larger sample size
 For greater statistical significance, use a larger sample size

Note: Sample size calculators use statistical formulas to determine a sample size.
More about these are coming up in the course! Stay tuned.

Why a minimum sample of 30?

This recommendation is based on the Central Limit Theorem (CLT) in the field of
probability and statistics. As sample size increases, the results more closely resemble
the normal (bell-shaped) distribution from a large number of samples. A sample of 30 is
the smallest sample size for which the CLT is still valid. Researchers who rely on

24
regression analysis – statistical methods to determine the relationships between
controlled and dependent variables – also prefer a minimum sample of 30.

Still curious? Without getting too much into the math, check out these articles:

 Central Limit Theorem (CLT): This article by Investopedia explains the Central Limit
Theorem and briefly describes how it can apply to an analysis of a stock index.
 Sample Size Formula: This article by Statistics Solutions provides a little more detail
about why some researchers use 30 as a minimum sample size.

Sample sizes vary by business problem

Sample size will vary based on the type of business problem you are trying to solve.

For example, if you live in a city with a population of 200,000 and get 180,000 people to
respond to a survey, that is a large sample size. But without actually doing that, what
would an acceptable, smaller sample size look like?

Would 200 be alright if the people surveyed represented every district in the city?

Answer: It depends on the stakes.

 A sample size of 200 might be large enough if your business problem is to find out how
residents felt about the new library
 A sample size of 200 might not be large enough if your business problem is to
determine how residents would vote to fund the library

You could probably accept a larger margin of error surveying how residents feel about
the new library versus surveying residents about how they would vote to fund it. For that
reason, you would most likely use a larger sample size for the voter survey.

Larger sample sizes have a higher cost

You also have to weigh the cost against the benefits of more accurate results with a
larger sample size. Someone who is trying to understand consumer preferences for a
new line of products wouldn’t need as large a sample size as someone who is trying to

25
understand the effects of a new drug. For drug safety, the benefits outweigh the cost of
using a larger sample size. But for consumer preferences, a smaller sample size at a
lower cost could provide good enough results.

Knowing the basics is helpful

Knowing the basics will help you make the right choices when it comes to sample size.
You can always raise concerns if you come across a sample size that is too small. A
sample size calculator is also a great tool for this. Sample size calculators let you enter
a desired confidence level and margin of error for a given population size. They then
calculate the sample size needed to statistically achieve those results.

Refer to the Determine the Best Sample Size video for a demonstration of a sample
size calculator, or refer to the Sample Size Calculator reading for additional information.

Key takeaways

As you continue on your data analytics journey, be sure to familiarize yourself with key
terms including population, sample, margin of error, confidence level, and confidence
interval before calculating sample size. Remember that a minimum sample size of 30 is
recommended and that sample size varies depending on the specific business problem.
Also consider the trade-off between accuracy and cost when determining sample size,
as larger sample sizes provide more accurate results but at a higher cost. Finally, use
sample size calculators to determine the appropriate sample size for your study.

Using statistical power


Hey, there. We've all probably dreamed of having a superpower at least once in our lives. I
know I have. I'd love to be able to fly. But there's another superpower you might not have heard
of: statistical power. Statistical power is the probability of getting meaningful results from a
test. I'm guessing that's not a superpower any of you have dreamed about.

Still, it's a pretty great data superpower. For data analysts, your projects might begin with the
test or study. Hypothesis testing is a way to see if a survey or experiment has meaningful

26
results. Here's an example. Let's say you work for a restaurant chain that's planning a marketing
campaign for their new milkshakes. You need to test the ad on a group of customers before
turning it into a nationwide ad campaign. In the test, you want to check whether customers like
or dislike the campaign.

You also want to rule out any factors outside of the ad that might lead them to say they don't like
it. Using all your customers would be too time consuming and expensive. So, you'll need to
figure out how many customers you'll need to show that the ad is effective. Fifty probably
wouldn't be enough. Even if you randomly chose 50 customers, you might end up with
customers who don't like milkshakes at all. And if that happens, you won't be able to measure
the effectiveness of your ad in getting more milkshake orders since no one in the sample size
would order them. That's why you need a larger sample size: so you can make sure you get a
good number of all types of people for your test.

Usually, the larger the sample size, the greater the chance you'll have statistically significant
results with your test. And that's statistical power. In this case, using as many customers as
possible will show the actual differences between the groups who like or dislike the ad versus
people whose decision wasn't based on the ad at all. There are ways to accurately calculate
statistical power, but we won't go into them here. You might need to calculate it on your own as
a data analyst. For now, you should know that statistical power is usually shown as a value out
of one. So if your statistical power is 0.6, that's the same thing as saying 60%.

In the milk shake ad test, if you found a statistical power of 60%, that means there's a 60%
chance of you getting a statistically significant result on the ad's effectiveness. "Statistically
significant" is a term that is used in statistics. If you want to learn more about the technical
meaning, you can search online. But in basic terms, if a test is statistically significant, it means
the results of the test are real and not an error caused by random chance. So there's a 60%
chance that the results of the milkshake ad test are reliable and real and a 40% chance that the
result of the test is wrong. Usually, you need a statistical power of at least 0.8 or 80% to
consider your results statistically significant. Let's check out one more scenario.

We'll stick with milkshakes because, well, because I like milkshakes. Imagine you work for a
restaurant chain that wants to launch a brand-new birthday cake flavored milkshake. This
milkshake will be more expensive to produce than your other milkshakes. Your company hopes
that the buzz around the new flavor will bring in more customers and money to offset this

27
cost. They want to test this out in a few restaurant locations first. So let's figure out how many
locations you'd have to use to be confident in your results. First, you'd have to think about what
might prevent you from getting statistically significant results.

Are there restaurants running any other promotions that might bring in new customers? Do
some restaurants have customers that always buy the newest item, no matter what it is? Do
some location have construction that recently started, that would prevent customers from even
going to the restaurant? To get a higher statistical power, you'd have to consider all of these
factors before you decide how many locations to include in your sample size for your study. You
want to make sure any effect is most likely due to the new milkshake flavor, not another
factor. The measurable effects would be an increase in sales or the number of customers at the
locations in your sample size. That's it for now.

Coming up, we'll explore sample sizes in more detail, so you can get a better idea of how they
impact your tests and studies. In the meantime, you've gotten to know a little bit more about
milkshakes and superpowers. And of course, statistical power. Sadly, only statistical power can
truly be useful for data analysts. Though putting on my cape and flying to grab a milkshake right
now does sound pretty good.

"Statistical power can be calculated and reported for a completed experiment to


comment on the confidence one might have in the conclusions drawn from the results of
the study. It can also be used as a tool to estimate the number of observations or
sample size required in order to detect an effect in an experiment."

When data isn't readily


available
Earlier, you learned how you can still do an analysis using proxy data if you have no
data. You might have some questions about proxy data, so this reading will give you a
few more examples of the types of datasets that can serve as alternate data sources.

28
Proxy data examples

Sometimes the data to support a business objective isn’t readily available. This is when
proxy data is useful. Take a look at the following scenarios and where proxy data comes
in for each example:

Business scenario How proxy data can be used

A new car model was just launched a few The analyst proxies the number of
days ago and the auto dealership can’t clicks to the car specifications on the
wait until the end of the month for sales dealership’s website as an estimate
data to come in. They want sales of potential sales at the dealership.
projections now.

A brand new plant-based meat product The analyst proxies the sales data
was only recently stocked in grocery for a turkey substitute made out of
stores and the supplier needs to estimate tofu that has been on the market for
the demand over the next four years. several years.

The Chamber of Commerce wants to The analyst proxies the historical


know how a tourism campaign is going to data for airline bookings to the city
impact travel to their city, but the results one to three months after a similar
from the campaign aren’t publicly campaign was run six months earlier.
available yet.

Open (public) datasets

If you are part of a large organization, you might have access to lots of sources of data.
But if you are looking for something specific or a little outside your line of business, you
can also make use of open or public datasets. (You can refer to this Medium article for a
brief explanation of the difference between open and public data.)

Here's an example. A nasal version of a vaccine was recently made available. A clinic
wants to know what to expect for contraindications, but just started collecting first-party
data from its patients. A contraindication is a condition that may cause a patient not
to take a vaccine due to the harm it would cause them if taken. To estimate the number

29
of possible contraindications, a data analyst proxies an open dataset from a trial of the
injection version of the vaccine. The analyst selects a subset of the data with patient
profiles most closely matching the makeup of the patients at the clinic.

There are plenty of ways to share and collaborate on data within a community. Kaggle
([Link]) which we previously introduced, has datasets in a variety of formats
including the most basic type, Comma Separated Values (CSV) files.

CSV, JSON, SQLite, and BigQuery datasets

 CSV: Check out this Credit card customers dataset, which has information from 10,000
customers including age, salary, marital status, credit card limit, credit card category,
etc. (CC0: Public Domain, Sakshi Goyal).
 JSON: Check out this JSON dataset for trending YouTube videos (CC0: Public Domain,
Mitchell J).
 SQLite: Check out this SQLite dataset for 24 years worth of U.S. wildfire data (CC0:
Public Domain, Rachael Tatman).
 BigQuery: Check out this Google Analytics 360 sample dataset from the Google
Merchandise Store (CC0 Public Domain, Google BigQuery).

Refer to the Kaggle documentation for datasets for more information and search for and
explore datasets on your own at [Link]/datasets.

As with all other kinds of datasets, be on the lookout for duplicate data and ‘Null’ in open
datasets. Null most often means that a data field was unassigned (left empty), but
sometimes Null can be interpreted as the value, 0. It is important to understand how
Null was used before you start analyzing a dataset with Null data.

30
Key takeaways

As you work on data analysis projects, proxy data can often be used to estimate or
predict outcomes when actual data is not available. Open or public datasets can be
used as proxy data sources, and there are many available online repositories for finding
relevant datasets. But be cautious when using proxy data and ensure that it is well-
suited for the intended purpose. Finally, check for duplicate data and null values in open
datasets before using them for analysis.

Determine the best sample size


Great to see you again. In this video, we'll go into more detail about sample sizes and data
integrity. If you've ever been to a store that hands out samples, you know it's one of life's little
pleasures. For me, anyway! those small samples are also a very smart way for businesses to
learn more about their products from customers without having to give everyone a free
sample. A lot of organizations use sample size in a similar way. They take one part of
something larger.

In this case, a sample of a population. Sometimes they'll perform complex tests on their data to
see if it meets their business objectives. We won't go into all the calculations needed to do this
effectively. Instead, we'll focus on a "big picture" look at the process and what it involves. As a
quick reminder, sample size is a part of a population that is representative of the population. For
businesses, it's a very important tool. It can be both expensive and time-consuming to analyze
an entire population of data.

Using sample size usually makes the most sense and can still lead to valid and useful
findings. There are handy calculators online that can help you find sample size. You need to
input the confidence level, population size, and margin of error. We've talked about population
size before. To build on that, we'll learn about confidence level and margin of error. Knowing
about these concepts will help you understand why you need them to calculate sample
size. The confidence level is the probability that your sample accurately reflects the greater
population.

You can think of it the same way as confidence in anything else. It's how strongly you feel that
you can rely on something or someone. Having a 99 percent confidence level is ideal. But most
industries hope for at least a 90 or 95 percent confidence level. Industries like

31
pharmaceuticals usually want a confidence level that's as high as possible when they are using
a sample size. This makes sense because they're testing medicines and need to be sure they
work and are safe for everyone to use. For other studies, organizations might just need to know
that the test or survey results have them heading in the right direction.

For example, if a paint company is testing out new colors, a lower confidence level is okay. You
also want to consider the margin of error for your study. You'll learn more about this soon, but it
basically tells you how close your sample size results are to what your results would be if you
use the entire population that your sample size represents. Think of it like this. Let's say that the
principal of a middle school approaches you with a study about students' candy
preferences. They need to know an appropriate sample size, and they need it now. The school
has a student population of 500, and they're asking for a confidence level of 95 percent and a
margin of error of 5 percent.

To use the sample size calculator spreadsheet, click the link below and select “Use
Template.”

Link to template: Sample Size Calculator

We've set up a calculator in a spreadsheet, but you can also easily find this type of calculator
by searching "sample size calculator" on the internet. Just like those calculators, our
spreadsheet calculator doesn't show any of the more complex calculations for figuring out
sample size. All we need to do is input the numbers for our population, confidence level, and
margin of error. And when we type 500 for our population size, 95 for our confidence level
percentage, 5 for our margin of error percentage, the result is about 218. That means for this
study, an appropriate sample size would be 218. If we surveyed 218 students and found that 55
percent of them preferred chocolate, then we could be pretty confident that would be true of all
500 students. 218 is the minimum number of people we need to survey based on our criteria of
a 95 percent confidence level and a 5 percent margin of error. In case you're wondering, the
confidence level and margin of error don't have to add up to 100 percent.

They're independent of each other. So let's say we change our margin of error from 5 percent to
3 percent. Then we find that our sample size would need to be larger, about 341 instead of
218, to make the results of the study more representative of the population. Feel free to practice
with an online calculator. Knowing sample size and how to find it will help you when you work

32
with data. We've got more useful knowledge coming your way, including learning about margin
of error. See you soon!

Sample size calculator


In this reading, you will learn the basics of sample size calculators, how to use them,
and how to understand the results. A sample size calculator tells you how many
people you need to interview (or things you need to test) to get results that represent
the target population. Let’s review some terms you will come across when using a
sample size calculator:

 Confidence level: The probability that your sample size accurately reflects the
greater population.
 Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population.
 Population: This is the total number you hope to pull your sample from.
 Sample: A part of a population that is representative of the population.
 Estimated response rate: If you are running a survey of individuals, this is the
percentage of people you expect will complete your survey out of those who received
the survey.

How to use a sample size calculator

In order to use a sample size calculator, you need to have the population size,
confidence level, and the acceptable margin of error already decided so you can input
them into the tool. If this information is ready to go, check out these sample size
calculators below:

 Sample size calculator by [Link]


 Sample size calculator by [Link]

What to do with the results

After you have plugged your information into one of these calculators, it will give you a
recommended sample size. Keep in mind, the calculated sample size is the minimum

33
number to achieve what you input for confidence level and margin of error. If you are
working with a survey, you will also need to think about the estimated response rate to
figure out how many surveys you will need to send out. For example, if you need a
sample size of 100 individuals and your estimated response rate is 10%, you will need
to send your survey to 1,000 individuals to get the 100 responses you need for your
analysis.

Now that you have the basics, try some calculations using the sample size calculators
and refer back to this reading if you need a refresher on the definitions.

Evaluate data reliability


Hey there! Earlier, we touched on margin of error without explaining it completely. Well, we're
going to right that wrong in this video by explaining margin of error more. We'll even include an
example of how to calculate it. As a data analyst, it's important for you to figure out sample size
and variables like confidence level and margin of error before running any kind of test or
survey. It's the best way to make sure your results are objective, and it gives you a better
chance of getting statistically significant results. But if you already know the sample size, like
when you're given survey results to analyze, you can calculate the margin of error yourself.

Then you'll have a better idea of how much of a difference there is between your sample
and your population. We'll start at the beginning with a more complete definition. Margin of error
is the maximum that the sample results are expected to differ from those of the actual
population. Let's think about an example of margin of error. It would be great to survey or test
an entire population, but it's usually impossible or impractical to do this. So instead, we take a
sample of the larger population. Based on the sample size, the resulting margin of error will tell
us how different the results might be compared to the results if we had surveyed the entire
population.

Margin of error helps you understand how reliable the data from your hypothesis testing is. The
closer to zero the margin of error, the closer your results from your sample would match results
from the overall population. For example, let's say you completed a nationwide survey using a
sample of the population. You asked people who work five-day workweeks whether they like the
idea of a four-day workweek. So your survey tells you that 60% prefer a four-day
workweek. The margin of error was 10%, which tells us that between 50 and 70% like the

34
idea. So if we were to survey all five-day workers nationwide, between 50 and 70% would agree
with our results.

Keep in mind that our range is between 50 and 70%. That's because the margin of error is
counted in both directions from the survey results of 60%. If you set up a 95% confidence level
for your survey, there'll be a 95% chance that the entire population's responses will fall between
50 and 70% saying, yes, they want a four-day workweek. Since your margin of error overlaps
with that 50% mark, you can't say for sure that the public likes the idea of a four-day
workweek. In that case, you'd have to say your survey was inconclusive. Now, if you wanted a
lower margin of error, say 5%, with a range between 55 and 65%, you could increase the
sample size. But if you've already been given the sample size, you can calculate the margin of
error yourself.

Then you can decide yourself how much of a chance your results have of being statistically
significant based on your margin of error. In general, the more people you include in your
survey, the more likely your sample is representative of the entire population. Decreasing the
confidence level would also have the same effect, but that would also make it less likely that
your survey is accurate. So to calculate margin of error, you need three things: population size,
sample size, and confidence level. And just like with sample size, you can find lots of calculators
online by searching "margin of error calculator." But we'll show you in a spreadsheet, just like
we did when we calculated sample size. Lets say you're running a study on the effectiveness of
a new drug.

You have a sample size of 500 participants whose condition affects 1% of the world's
population. That's about 80 million people, which is the population for your study. Since it's a
drug study, you need to have a confidence level of 99%. You also need a low margin of
error. Let's calculate it. We'll put the numbers for population, confidence level, and sample size,
in the appropriate spreadsheet cells. And our result is a margin of error of close to 6%, plus or
minus.

When the drug study is complete, you'd apply the margin of error to your results to determine
how reliable your results might be. Calculators like this one in the spreadsheet are just one of
the many tools you can use to ensure data integrity. And it's also good to remember that
checking for data integrity and aligning the data with your objectives will put you in good shape
to complete your analysis. Knowing about sample size, statistical power, margin of error,

35
and other topics we've covered will help your analysis run smoothly. That's a lot of new
concepts to take in. If you'd like to review them at any time, you can find them all in the
glossary, or feel free to rewatch the video! Soon you'll explore the ins and outs of clean data.

All about margin of error


Margin of error is the maximum amount that the sample results are expected to
differ from those of the actual population. More technically, the margin of error defines a
range of values below and above the average result for the sample. The average result
for the entire population is expected to be within that range. We can better understand
margin of error by using some examples below.

Margin of error in baseball

36
Imagine you are playing baseball and that you are up at bat. The crowd is roaring, and
you are getting ready to try to hit the ball. The pitcher delivers a fastball traveling about
90-95mph, which takes about 400 milliseconds (ms) to reach the catcher’s glove. You
swing and miss the first pitch because your timing was a little off. You wonder if you
should have swung slightly earlier or slightly later to hit a home run. That time difference
can be considered the margin of error, and it tells us how close or far your timing was
from the average home run swing.

Margin of error in marketing

The margin of error is also important in marketing. Let’s use A/B testing as an example.
A/B testing (or split testing) tests two variations of the same web page to determine
which page is more successful in attracting user traffic and generating revenue. User
traffic that gets monetized is known as the conversion rate. A/B testing allows
marketers to test emails, ads, and landing pages to find the data behind what is working
and what isn’t working. Marketers use the confidence interval (determined by the
conversion rate and the margin of error) to understand the results.

For example, suppose you are conducting an A/B test to compare the effectiveness of
two different email subject lines to entice people to open the email. You find that subject
line A: “Special offer just for you” resulted in a 5% open rate compared to subject line B:
“Don’t miss this opportunity” at 3%.

Does that mean subject line A is better than subject line B? It depends on your margin
of error. If the margin of error was 2%, then subject line A’s actual open rate or
confidence interval is somewhere between 3% and 7%. Since the lower end of the
interval overlaps with subject line B’s results at 3%, you can’t conclude that there is a
statistically significant difference between subject line A and B. Examining the margin of
error is important when making conclusions based on your test results.

Want to calculate your margin of error?

All you need is population size, confidence level, and sample size. In order to better
understand this calculator, review these terms:

37
 Confidence level: A percentage indicating how likely your sample accurately reflects
the greater population
 Population: The total number you pull your sample from
 Sample: A part of a population that is representative of the population
 Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population

In most cases, a 90% or 95% confidence level is used. But, depending on your industry,
you might want to set a stricter confidence level. A 99% confidence level is reasonable
in some industries, such as the pharmaceutical industry.

After you have settled on your population size, sample size, and confidence level, plug
the information into a margin of error calculator like the ones below:

 Margin of error calculator by Good Calculators (free online calculators)


 Margin of error calculator by CheckMarket

Key takeaways

Margin of error is used to determine how close your sample’s result is to what the result
would likely have been if you could have surveyed or tested the entire population.
Margin of error helps you understand and interpret survey or test results in real-life.
Calculating the margin of error is particularly helpful when you are given the data to
analyze. After using a calculator to calculate the margin of error, you will know how
much the sample results might differ from the results of the entire population.

Glossary terms from module 1


Additional terms and definitions

Accuracy: The degree to which the data conforms to the actual entity being
measured or described

Completeness: The degree to which the data contains all desired components or
measures

Confidence interval: A range of values that conveys how likely a statistical


estimate reflects the population
38
Confidence level: The probability that a sample size accurately reflects the greater
population

Consistency: The degree to which data is repeatable from different points of entry or
collection

Cross-field validation: A process that ensures certain conditions for multiple data
fields are satisfied

Data constraints: The criteria that determine whether a piece of a data is clean and
valid

Data integrity: The accuracy, completeness, consistency, and trustworthiness of


data throughout its life cycle

Data manipulation: The process of changing data to make it more organized and
easier to read

Data range: Numerical values that fall between predefined maximum and minimum
values

Data replication: The process of storing data in multiple locations

DATEDIF: A spreadsheet function that calculates the number of days, months, or


years between two dates

Estimated response rate: The average number of people who typically complete
a survey

Hypothesis testing: A process to determine if a survey or experiment has


meaningful results

Mandatory: A data value that cannot be left blank or empty

Margin of error: The maximum amount that the sample results are expected to
differ from those of the actual population

Random sampling: A way of selecting a sample from a population so that every


possible type of the sample has an equal chance of being chosen

39
Regular expression (RegEx): A rule that says the values in a table must match a
prescribed pattern

Clean it up!
Can you guess what inaccurate or bad data costs businesses every year? Thousands of
dollars? Millions? Billions? Well, according to IBM, the yearly cost of poor quality data is $3.1
trillion in the US alone. That's a lot of zeros. Now, can you guess the number 1 cause of poor
quality data?

It's not a new system implementation or a computer technical glitch. The most common factor is
actually human error. Here's a spreadsheet from a law office. It shows customers, the legal
services they bought, the service order number, how much they paid, and the payment
method. Dirty data can be the result of someone typing in a piece of data
incorrectly. Inconsistent formatting, blank fields, or the same piece of data being entered more
than once, which causes duplicates. Dirty data is data that's incomplete, incorrect, or
irrelevant to the problem you're trying to solve.

When you work with dirty data, you can't be sure that your results are correct. In fact, you can
pretty much bet they won't be. Earlier, you learned that data integrity is critical to reliable data
analytics results, and clean data helps you achieve data integrity. Clean data is data that's
complete, correct, and relevant to the problem you're trying to solve. When you work with clean
data, you'll find that your projects go much more smoothly. I remember the first time I
witnessed firsthand how important clean data really is. I had just started using SQL and I
thought it worked like magic.

I could have the computer sum up millions of numbers, saving me tons of time and effort. But I
quickly discovered that only works when the data is clean. If there was even one accidental
letter in a column that should only have numbers, the computer wouldn't know what to do. So it
would throw an error, and suddenly I was stuck, and there's no way I can add up millions of
numbers by myself. So I had to clean up that data to make it work. The good news is that
there's plenty of effective processes and tools to help you do that. Coming up, you'll gain the
skills and knowledge you need to make sure the data you work with is always clean.

Why data cleaning is critical

40
Clean data is incredibly important for effective analysis. If a piece of data is entered into a
spreadsheet or database incorrectly, or if it's repeated, or if a field is left blank, or if data formats
are inconsistent, the result is dirty data. Small mistakes can lead to big consequences in the
long run. I'll be completely honest with you, data cleaning is like brushing your teeth. It's
something you should do and do properly because otherwise it can cause serious problems. For
teeth, that might be cavities or gum disease. For data, that might be costing your company
money, or an angry boss.

But here's the good news. If you keep brushing twice a day, every day, it becomes a
habit. Soon, you don't even have to think about it. It's the same with data. Trust me, it will make
you look great when you take the time to clean up that dirty data. As a quick refresher, dirty data
is incomplete, incorrect, or irrelevant to the problem you're trying to solve. It can't be used in a
meaningful way, which makes analysis very difficult, if not impossible.

On the other hand, clean data is complete, correct, and relevant to the problem you're trying to
solve. This allows you to understand and analyze information and identify important
patterns, connect related information, and draw useful conclusions. Then you can apply what
you learn to make effective decisions. In some cases, you won't have to do a lot of work to
clean data. For example, when you use internal data that's been verified and cared for by your
company's data engineers and data warehouse team, it's more likely to be clean. Let's talk
about some people you'll work with as a data analyst. Data engineers transform data into a
useful format for analysis and give it a reliable infrastructure.

This means they develop, maintain, and test databases, data processors and related
systems. Data warehousing specialists develop processes and procedures to effectively store
and organize data. They make sure that data is available, secure, and backed up to prevent
loss. When you become a data analyst, you can learn a lot by working with the person who
maintains your databases to learn about their systems. If data passes through the hands of a
data engineer or a data warehousing specialist first, you know you're off to a good start on your
project. There's a lot of great career opportunities as a data engineer or a data warehousing
specialist. If this kind of work sounds interesting to you, maybe your career path will
involve helping organizations save lots of time, effort, and money by making sure their data is
sparkling clean.

41
But even if you go in a different direction with your data analytics career and have the
advantage of working with data engineers and warehousing specialists, you're still likely to have
to clean your own data. It's important to remember: no dataset is perfect. It's always a good idea
to examine and clean data before beginning analysis. Here's an example. Let's say you're
working on a project where you need to figure out how many people use your company's
software program. You have a spreadsheet that was created internally and verified by a data
engineer and a data warehousing specialist. Check out the column labeled "Username."

It might seem logical that you can just scroll down and count the rows to figure out how many
users you have. But that won't work because one person sometimes has more than one
username. Maybe they registered from different email addresses, or maybe they have a work
and personal account. In situations like this, you would need to clean the data by eliminating
any rows that are duplicates. Once you've done that, there won't be any more duplicate
entries. Then your spreadsheet is ready to be put to work. So far we've discussed working with
internal data.

But data cleaning becomes even more important when working with external data, especially if it
comes from multiple sources. Let's say the software company from our example surveyed its
customers to learn how satisfied they are with its software product. But when you review the
survey data, you find that you have several nulls. A null is an indication that a value does not
exist in a data set. Note that it's not the same as a zero. In the case of a survey, a null would
mean the customers skipped that question. A zero would mean they provided zero as their
response.

To do your analysis, you would first need to clean this data. Step one would be to decide what
to do with those nulls. You could either filter them out and communicate that you now have a
smaller sample size, or you can keep them in and learn from the fact that the customers did not
provide responses. There's lots of reasons why this could have happened. Maybe your survey
questions weren't written as well as they could be. Maybe they were confusing or
biased, something we learned about earlier. We've touched on the basics of cleaning internal
and external data, but there's lots more to come.

What is dirty data?


42
Earlier, we discussed that dirty data is data that is incomplete, incorrect, or irrelevant
to the problem you are trying to solve. This reading summarizes:

 Types of dirty data you may encounter


 What may have caused the data to become dirty
 How dirty data is harmful to businesses

Types of dirty data

Duplicate data

Description Possible causes Potential harm to businesses

Any data record that Manual data entry, batch Skewed metrics or analyses, inflated or
shows up more than data imports, or data inaccurate counts or predictions, or
once migration confusion during data retrieval

43
Outdated data

Description Possible causes Potential harm to


businesses

Any data that is old which should be People changing roles or Inaccurate insights,
replaced with newer and more companies, or software and decision-making, an
accurate information systems becoming obsolete analytics

Incomplete data

Description Possible causes Potential harm to businesses

Any data that is missing Improper data collection Decreased productivity, inaccurate insig
important fields or incorrect data entry or inability to complete essential service

Incorrect/inaccurate data

Description Possible causes Potential harm to businesses

Any data that is Human error inserted during Inaccurate insights or decision-mak
complete but data input, fake information, or based on bad information resulting
inaccurate mock data revenue loss

Inconsistent data

Description Possible causes Potential harm to businesses

Any data that uses different Data stored incorrectly or Contradictory data points leading t
formats to represent the errors inserted during data confusion or inability to classify or
same thing transfer segment customers

44
Business impact of dirty data

For further reading on the business impact of dirty data, enter the term “dirty data” into
your preferred browser’s search bar to bring up numerous articles on the topic. Here are
a few impacts cited for certain industries from a previous search:

 Banking: Inaccuracies cost companies between 15% and 25% of revenue (source).
 Digital commerce: Up to 25% of B2B database contacts contain inaccuracies
(source).
 Marketing and sales: 99% of companies are actively tackling data quality in some
way (source).
 Healthcare: Duplicate records can be 10% and even up to 20% of a hospital’s
electronic health records (source).

Recognize and remedy dirty data


Hey, there. In this video, we'll focus on common issues associated with dirty data. These
includes spelling and other texts errors, inconsistent labels, formats and field lane, missing data
and duplicates. This will help you recognize problems quicker and give you the information
you need to fix them when you encounter something similar during your own analysis. This is
incredibly important in data analytics. Let's go back to our law office spreadsheet. As a quick
refresher, we'll start by checking out the different types of dirty data it shows.

Sometimes, someone might key in a piece of data incorrectly. Other times, they might not keep
data formats consistent. It's also common to leave a field blank. That's also called a null, which
we learned about earlier. If someone adds the same piece of data more than once, that creates
a duplicate. Let's break that down. Then we'll learn about a few other types of dirty data and
strategies for cleaning it.

Misspellings, spelling variations, mixed up letters, inconsistent punctuation, and typos in


general, happen when someone types in a piece of data incorrectly. As a data analyst, you'll
also deal with different currencies. For example, one dataset could be in US dollars and another
in euros, and you don't want to get them mixed up. We want to find these types of errors and fix
them like this. You'll learn more about this soon. Clean data depends largely on the data
integrity rules that an organization follows, such as spelling and punctuation guidelines. For
example, a beverage company might ask everyone working in its database to enter data
about volume in fluid ounces instead of cups.

45
It's great when an organization has rules like this in place. It really helps minimize the amount of
data cleaning required, but it can't eliminate it completely. Like we discussed earlier, there's
always the possibility of human error. The next type of dirty data our spreadsheet shows is
inconsistent formatting. In this example, something that should be formatted as currency is
shown as a percentage. Until this error is fixed, like this, the law office will have no idea how
much money this customer paid for its services. We'll learn about different ways to solve this
and many other problems soon.

We discussed nulls previously, but as a reminder, nulls are empty fields. This kind of dirty data
requires a little more work than just fixing a spelling error or changing a format. In this
example, the data analysts would need to research which customer had a consultation on July
4th, 2020. Then when they find the correct information, they'd have to add it to the
spreadsheet. Another common type of dirty data is duplicated. Maybe two different people
added this appointment on August 13th, not realizing that someone else had already done it
or maybe the person entering the data hit copy and paste by accident. Whatever the reason, it's
the data analyst job to identify this error and correct it by deleting one of the duplicates.

Now, let's continue on to some other types of dirty data. The first has to do with labeling. To
understand labeling, imagine trying to get a computer to correctly identify panda bears among
images of all different kinds of animals. You need to show the computer thousands of images of
panda bears. They're all labeled as panda bears. Any incorrectly labeled picture, like the one
here that's just bear, will cause a problem. The next type of dirty data is having an inconsistent
field length.

You learned earlier that a field is a single piece of information from a row or column of a
spreadsheet. Field length is a tool for determining how many characters can be keyed into a
field. Assigning a certain length to the fields in your spreadsheet is a great way to avoid
errors. For instance, if you have a column for someone's birth year, you know the field length is
four because all years are four digits long. Some spreadsheet applications have a simple way to
specify field lengths and make sure users can only enter a certain number of characters into a
field. This is part of data validation. Data validation is a tool for checking the accuracy and
quality of data before adding or importing it.

Data-cleaning tools and techniques

46
. Now that you're familiar with some of the most common types of dirty data, it's time to clean
them up. As you've learned, clean data is essential to data integrity and reliable solutions and
decisions. The good news is that spreadsheets have all kinds of tools you can use to get
your data ready for analysis. The techniques for data cleaning will be different depending on the
specific data set you're working with. So we won't cover everything you might run into, but this
will give you a great starting point for fixing the types of dirty data analysts find most often. Think
of everything that's coming up as a teaser trailer of data cleaning tools.

I'm going to give you a basic overview of some common tools and techniques, and then we'll
practice them again later on. Here, we'll discuss how to remove unwanted data, clean up text to
remove extra spaces and blanks, fix typos, and make formatting consistent. However, before
removing unwanted data, it's always a good practice to make a copy of the data set. That way, if
you remove something that you end up needing in the future, you can easily access it and put it
back in the data set. Once that's done, then you can move on to getting rid of the duplicates
or data that isn't relevant to the problem you're trying to solve. Typically, duplicates appear
when you're combining datasets from more than one source or using data from multiple
departments within the same business. You've already learned a bit about duplicates, but let's
practice removing them once more now using this spreadsheet, which lists members of a
professional logistics association.

Duplicates can be a big problem for data analysts. So it's really important that you can find
and remove them before any analysis starts. Here's an example of what I'm talking about. Let's
say this association has duplicates of one person's $500 membership in its database. When the
data is summarized, the analyst would think there was $1,000 being paid by this member and
would make decisions based on that incorrect data. But in reality, this member only paid
$500. These problems can be fixed manually, but most spreadsheet applications also offer lots
of tools to help you find and remove duplicates.

Now, irrelevant data, which is data that doesn't fit the specific problem that you're trying to solve,
also needs to be removed. Going back to our association membership list example, let's say a
data analyst was working on a project that focused only on current members. They wouldn't
want to include information on people who are no longer members, or who never joined in the
first place. Removing irrelevant data takes a little more time and effort because you have to
figure out the difference between the data you need and the data you don't. But believe
me, making those decisions will save you a ton of effort down the road. The next step is

47
removing extra spaces and blanks. Extra spaces can cause unexpected results when you sort,
filter, or search through your data.

And because these characters are easy to miss, they can lead to unexpected and confusing
results. For example, if there's an extra space and in a member ID number, when you sort the
column from lowest to highest, this row will be out of place. To remove these unwanted spaces
or blank cells, you can delete them yourself. Or again, you can rely on your spreadsheets, which
offer lots of great functions for removing spaces or blanks automatically. The next data cleaning
step involves fixing misspellings, inconsistent capitalization, incorrect punctuation, and other
typos. These types of errors can lead to some big problems. Let's say you have a database of
emails that you use to keep in touch with your customers.

If some emails have misspellings, a period in the wrong place, or any other kind of typo, not only
do you run the risk of sending an email to the wrong people, you also run the risk of spamming
random people. Think about our association membership example again. Misspelling might
cause the data analyst to miscount the number of professional members if they sorted this
membership type and then counted the number of rows. Like the other problems you've come
across, you can also fix these problems manually. Or you can use spreadsheet tools, such as
spellcheck, autocorrect, and conditional formatting to make your life easier. There's also easy
ways to convert text to lowercase, uppercase, or proper case, which is one of the things we'll
check out again later. All right, we're getting there.

The next step is removing formatting. This is particularly important when you get data from lots
of different sources. Every database has its own formatting, which can cause the data to seem
inconsistent. Creating a clean and consistent visual appearance for your spreadsheets will
help make it a valuable tool for you and your team when making key decisions. Most
spreadsheet applications also have a "clear formats" tool, which is a great time saver. Cleaning
data is an essential step in increasing the quality of your data. Now you know lots of different
ways to do that.

Clean data from multiple sources


en

48
Interactive Transcript - Enable basic transcript
mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

- Welcome back. So far, you learned a lot about dirty data, and how to clean up the most
common errors in the dataset. Now we're going to take that a step further, and talk about
cleaning up multiple datasets. Cleaning data that comes from two or more sources is very
common for data analysts, but it does come with some interesting challenges. A good example
is a merger, which is an agreement that unites two organizations into a single new one. In the
logistics field, there's been lots of big changes recently, mostly because of the e-commerce
boom. With so many people shopping online, it makes sense that the companies responsible for
delivering these products to their homes are in the middle of a big shakeup.

When big things happen in an industry, it's common for two organizations to team up and
become stronger through a merger. Let's talk about how that'll affect our Logistics
Association. As a quick reminder, this spreadsheet lists association member ID numbers, first
and last names, addresses, how much each member pays in dues, when the membership
expires, and the membership types. Now, let's think about what would happen if the
International Logistics Association decided to get together with the Global Logistics
Association in order to help the members handle the incredible demands of e-commerce. First,
all the data from each organization would need to be combined using data merging. Data
merging is the process of combining two or more datasets into a single dataset. This presents a
unique challenge because when two totally different datasets are combined, the information is
almost guaranteed to be inconsistent and misaligned.

For example, the Global Logistics Association spreadsheet has a separate column for a
person's suite, apartment, or unit number, but the International Logistics Association combines
that information with their street address. This needs to be corrected to make the number of
address columns consistent. Next, check out how the Global Logistics Association uses
people's email addresses as their member ID, while the International Logistics Association uses

49
numbers. This is a big problem because people in a certain industry such as logistics, typically
join multiple professional associations. So there's a very good chance that these
datasets include membership information on the exact same person, just in different ways. It's
super important to remove those duplicates. Also, the Global Logistics Association has many
more member types than the other organization.

On top of that, it uses the term young professional instead of student associate, but both
describe members who are still in school, or just starting their careers. If you were merging
these two datasets, you need to work with your team to fix the fact that the two associations
describe memberships very differently. Okay, now you understand why the merging of
organizations also requires the merging of data, and that can be tricky, but there's lots of other
reasons why data analysts merge datasets. For example, in one of my past jobs, I merged a lot
of data from multiple sources to get insights about our customer's purchases. The kinds of
insights I gained helped me identify customer buying patterns. When merging datasets, I always
begin by asking myself some key questions to help me avoid redundancy, and to confirm that
the datasets are compatible. In data analytics, compatibility describes how well two or more
datasets are able to work together.

So the first question I would ask is, do I have all the data I need? To gather customer purchase
insights I wanted to make sure I had data on customers, their purchases, and where they
shopped. Next, I would ask, does the data I need exist within these datasets? As you learned
earlier in this program, this involves considering the entire dataset analytically. Looking through
the data before I start using it lets me get a feel for what it's all about, what the schema looks
like, if it's relevant to my customer purchase insights, and if it's clean data. That brings me to the
next question. Do the datasets need to be cleaned, or are they ready for me to use?

And because I'm working with more than one source, I'll also ask myself, are the datasets
cleaned to the same standard? For example, what fields are regularly repeated? How are
missing values handled? How recently was the data updated? Finding the answers to these
questions, and understanding if I need to fix any problems at the start of a project is a very
important step in data merging. In both the examples we explored here, a data analyst could
use either the spreadsheet tools, or SQL queries to clean up, merge, and prepare the datasets
for analysis. Depending on the tool you decide to use, the

50
Common data-cleaning pitfalls
In this reading, you will learn the importance of data cleaning and how to identify
common mistakes. Some of the errors you might come across while cleaning your data
could include:

51
Common mistakes to avoid

 Not checking for spelling errors: Misspellings can be as simple as typing or


input errors. Most of the time the wrong spelling or common grammatical errors can be
detected, but it gets harder with things like names or addresses. For example, if you are
working with a spreadsheet table of customer data, you might come across a customer
named “John” whose name has been input incorrectly as “Jon” in some places. The
spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for
spelling errors and catch this, your analysis will have mistakes in it.
 Forgetting to document errors: Documenting your errors can be a big time
saver, as it helps you avoid those errors in the future by showing you how you resolved
them. For example, you might find an error in a formula in your spreadsheet. You
discover that some of the dates in one of your columns haven’t been formatted
correctly. If you make a note of this fix, you can reference it the next time your formula is
broken, and get a head start on troubleshooting. Documenting your errors also helps
you keep track of changes in your work, so that you can backtrack if a fix didn’t work.
 Not checking for misfielded values: A misfielded value happens when the
values are entered into the wrong field. These values might still be formatted correctly,
which makes them harder to catch if you aren’t careful. For example, you might have a
dataset with columns for cities and countries. These are the same type of data, so they
are easy to mix up. But if you were trying to find all of the instances of Spain in the
country column, and Spain had mistakenly been entered into the city column, you would
miss key data points. Making sure your data has been entered correctly is key to
accurate, complete analysis.
 Overlooking missing values: Missing values in your dataset can create errors
and give you inaccurate conclusions. For example, if you were trying to get the total
number of sales from the last three months, but a week of transactions were missing,
your calculations would be inaccurate. As a best practice, try to keep your data as
clean as possible by maintaining completeness and consistency.
 Only looking at a subset of the data: It is important to think about all of the
relevant data when you are cleaning. This helps make sure you understand the whole
story the data is telling, and that you are paying attention to all possible errors. For
example, if you are working with data about bird migration patterns from different
sources, but you only clean one source, you might not realize that some of the data is
being repeated. This will cause problems in your analysis later on. If you want to avoid
common errors like duplicates, each field of your data requires equal attention.

52
 Losing track of business objectives: When you are cleaning data, you might
make new and interesting discoveries about your dataset-- but you don’t want those
discoveries to distract you from the task at hand. For example, if you were working with
weather data to find the average number of rainy days in your city, you might notice
some interesting patterns about snowfall, too. That is really interesting, but it isn’t
related to the question you are trying to answer right now. Being curious is great! But try
not to let it distract you from the task at hand.
 Not fixing the source of the error: Fixing the error itself is important. But if that
error is actually part of a bigger problem, you need to find the source of the issue.
Otherwise, you will have to keep fixing that same error over and over again. For
example, imagine you have a team spreadsheet that tracks everyone’s progress. The
table keeps breaking because different people are entering different values. You can
keep fixing all of these problems one by one, or you can set up your table to streamline
data entry so everyone is on the same page. Addressing the source of the errors in your
data will save you a lot of time in the long run.
 Not analyzing the system prior to data cleaning: If we want to clean our
data and avoid future errors, we need to understand the root cause of your dirty data.
Imagine you are an auto mechanic. You would find the cause of the problem before you
started fixing the car, right? The same goes for data. First, you figure out where the
errors come from. Maybe it is from a data entry error, not setting up a spell check, lack
of formats, or from duplicates. Then, once you understand where bad data comes from,
you can control it and keep your data clean.
 Not backing up your data prior to data cleaning: It is always good to be
proactive and create your data backup before you start your data clean-up. If your
program crashes, or if your changes cause a problem in your dataset, you can always
go back to the saved version and restore it. The simple procedure of backing up your
data can save you hours of work-- and most importantly, a headache.
 Not accounting for data cleaning in your deadlines/process: All good
things take time, and that includes data cleaning. It is important to keep that in mind
when going through your process and looking at your deadlines. When you set aside
time for data cleaning, it helps you get a more accurate estimate for ETAs for
stakeholders, and can help you know when to request an adjusted ETA.

53
Key takeaways

Data cleaning is essential for accurate analysis and decision-making. Common


mistakes to avoid when cleaning data include spelling errors, misfielded values, missing
values, only looking at a subset of the data, losing track of business objectives, not
fixing the source of the error, not analyzing the system prior to data cleaning, not
backing up your data prior to data cleaning, and not accounting for data cleaning in your
deadlines/process. By avoiding these mistakes, you can ensure that your data is clean
and accurate, leading to better outcomes for your business.

Additional resources

Refer to these "top ten" lists for data cleaning in Microsoft Excel and Google Sheets to
help you avoid the most common mistakes:

 Top ten ways to clean your data: Review an orderly guide to data cleaning in Microsoft
Excel.
 10 Google Workspace tips to clean up data: Learn best practices for data cleaning in
Google Sheets.

Step-by-Step guide: Data-


cleaning features in
spreadsheets
This reading outlines the steps the instructor performs in the next video, Data-cleaning
features in spreadsheets. In the video, the instructor explains how to use menu options
in spreadsheets to fix errors.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful
reference if you need additional context or clarification while following the video steps.
This is not a graded activity, but you can complete these steps to practice the skills
demonstrated in the video.

What you’ll need

54
If you’d like to follow along with the examples in this video, choose a spreadsheet tool.
Google Sheets or Excel are recommended.

To access the spreadsheet the instructor uses in this video, click the link to the template
to create a copy of the dataset. If you don’t have a Google account, download the data
directly from the attachments below.

Link to logistics data: International Logistics Association Memberships - Data for


Cleaning

Link to cosmetics data: Cosmetics Inc. - Data for Cleaning

OR

Example 1: Use conditional formatting to highlight blank cells

Conditional formatting is a spreadsheet tool that changes how cells appear when values
meet specific conditions.

1. Open the spreadsheet International Logistics Association Memberships - Data for


Cleaning.
2. Select the range of cells to which you’ll apply conditional formatting. In this example,
you’ll select columns A through L, except for columns F and H. To select all
columns except for F and H: a. Select cell A to highlight column A. b. Hold down the
SHIFT key and at the same time use your mouse to select cell E. This will highlight all
the columns between A and E. c. To select the remainder of the columns, hold down
the CONTROL (Windows) or COMMAND (Mac) key while you select cells G, I, J, K,
and L. d. Columns A through L in your spreadsheet should be highlighted except
Column F and Column H.

55
3. From the menu, select Format, then Conditional formatting. The columns
you’ve selected should turn a light shade of green, and a new Conditional format
rules tool will appear. Additionally, the Apply to range field should indicate the
cells you’ve selected.
4. Next, apply a condition to these cells to change the cell color if the cell is empty. In the
Format cells if drop-down, select Cell is empty.
5. Select the Formatting style field. Select a bright color from the drop-down to make
the blank cells stand out.
6. Select Done.

Example 2: Remove duplicates

Remove duplicates is a spreadsheet tool that automatically searches for and eliminates
duplicate entries from a spreadsheet. This is faster and easier than searching the data
by scrolling through it.

1. Create a copy of your dataset by right clicking the Association ABC membership
tab and selecting Duplicate. This is a good practice, as it protects against
accidentally deleting important data. Continue working in the new sheet, Copy of
Association ABC memberships.
2. In the menu, select Data, then Data cleanup, then Remove duplicates.
3. Check the box next to Data has header row.
4. Check the box next to Select All to inspect the entire spreadsheet.
5. Select Remove duplicates.

Example 3: Format dates consistently

Format dates to make all of the data in your spreadsheet consistent. This makes items
easier to find and manipulate.

1. Select column J (Membership valid through), which contains dates.


2. From the menu, select Format, then Number, then Date.

56
Example 4: Use split to separate data into columns

The split menu option is helpful when you have more than one piece of data in a cell
and you want to separate those pieces of data into different cells.

1. Select column L (Certification).


2. In the menu, select Data, then Split text to columns.
3. The delimiter (for example, a comma) will be automatically detected.
4. If needed, specify the separator manually in the dropdown that appears in your
spreadsheet.

Example 5: Use split to fix numbers stored as text

SPLIT is a spreadsheet function that divides text around a specified character and puts
each fragment into a new, separate cell.

1. Open the spreadsheet Cosmetics Inc. - Data for Cleaning.


2. Notice that cell F12 contains an error.
3. Select column E (Orders).
4. In the menu select Data, then select Split text to columns.
5. This removes the quotation marks from cell E12 so the spreadsheet recognizes the
data in the cell as a number. This resolves the error in cell F12.

Data-cleaning features in spreadsheets


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hi again. As you learned earlier, there's a lot of different ways to clean up data. I've shown you
some examples of how you can clean data manually, such as searching for and

57
fixing misspellings or removing empty spaces and duplicates. We also learned that lots of
spreadsheet applications have tools that help simplify and speed up the data cleaning
process. There's a lot of great efficiency tools that data analysts use all the time, such as
conditional formatting, removing duplicates, formatting dates, fixing text strings and
substrings, and splitting text to columns. We'll explore those in more detail now. The first is
something called conditional formatting.

Conditional formatting is a spreadsheet tool that changes how cells appear when values meet
specific conditions. Likewise, it can let you know when a cell does not meet the conditions
you've set. Visual cues like this are very useful for data analysts, especially when we're working
in a large spreadsheet with lots of data. Making certain data points standout makes the
information easier to understand and analyze. For cleaning data, knowing when the data
doesn't follow the condition is very helpful. Let's return to the logistics association
spreadsheet to check out conditional formatting in action. We'll use conditional formatting to
highlight blank cells.

That way, we know where there's missing information so we can add it to the spreadsheet. To
do this, we'll start by selecting the range we want to search. For this example we're not focused
on address 3 and address 5. The fields will include all the columns in our spreadsheets,
except for F and H. Next, we'll go to Format and choose Conditional formatting. Great. Our
range is automatically indicated in the field.

The format rule will be to format cells if the cell is empty. Finally, we'll choose the formatting
style. I'm going to pick a shade of bright pink, so my blanks really stand out. Then click "Done,"
and the blank cells are instantly highlighted. The next spreadsheet tool removes duplicates. As
you've learned before, it's always smart to make a copy of the data set before removing
anything. Let's do that now.

Great, now we can continue. You might remember that our example spreadsheet has one
association member listed twice. To fix that, go to Data and select "Remove
duplicates." "Remove duplicates" is a tool that automatically searches for and eliminates
duplicate entries from a spreadsheet. Choose "Data has header row" because our
spreadsheet has a row at the very top that describes the contents of each column. Next, select
"All" because we want to inspect our entire spreadsheet. Finally, "Remove duplicates."

58
You'll notice the duplicate row was found and immediately removed. Another useful
spreadsheet tool enables you to make formats consistent. For example, some of the dates
in this spreadsheet are in a standard date format. This could be confusing if you wanted
to analyze when association members joined, how often they renewed their memberships, or
how long they've been with the association. To make all of our dates consistent, first select
column J, then go to "Format," select "Number," then "Date." Now all of our dates have a
consistent format. Before we go over the next tool, I want to explain what a text string is.

In data analytics, a text string is a group of characters within a cell, most often composed of
letters. An important characteristic of a text string is its length, which is the number of characters
in it. You'll learn more about that soon. For now, it's also useful to know that a substring is a
smaller subset of a text string. Now let's talk about Split. Split is a tool that divides a text string
around the specified character and puts each fragment into a new and separate cell. Split is
helpful when you have more than one piece of data in a cell and you want to separate them
out.

This might be a person's first and last name listed together, or it could be a cell that contains
someone's city, state, country, and zip code, but you actually want each of those in its own
column. Let's say this association wanted to analyze all of the different professional
certifications its members have earned. To do this, you want each certification separated out
into its own column. Right now, the certifications are separated by a comma. That's the
specified text separating each item, also called the delimiter. Let's get them separated. Highlight
the column, then select "Data," and "Split text to columns."

This spreadsheet application automatically knew that the comma was a delimiter and separated
each certification. But sometimes you might need to specify what the delimiter should be. You
can do that here. Split text to columns is also helpful for fixing instances of numbers stored as
text. Sometimes values in your spreadsheet will seem like numbers, but they're formatted as
text. This can happen when copying and pasting from one place to another or if the formatting's
wrong. For this example, let's check out our new spreadsheet from a cosmetics maker.

If a data analyst wanted to determine total profits, they could add up everything in column F. But
there's a problem; one of the cells has an error. If you check into it, you learn that the "707" in
this cell is text and can't be changed into a number. When the spreadsheet tries to multiply the
cost of the product by the number of units sold, it's unable to make the calculation. But if we

59
select the orders column and choose "Split text to columns," the error is resolved because now
it can be treated as a number. Coming up, you'll learn about a tool that does just the
opposite. CONCATENATE is a function that joins multiple text strings into a single string.

Spreadsheets are a very important part of data analytics. They save data analysts time and
effort and help us eliminate errors each and every day. Here, you've learned about some of the
most common tools that we use. But there's a lot more to come. Next, we'll learn even more
about data cleaning with spreadsheet tools. Bye for now!

Step-by-Step: Optimize the


data-cleaning process
This reading outlines steps the instructor performs in the following video, Optimize the
data-cleaning process. The video teaches some useful spreadsheet functions, which
can make your data-cleaning even more successful.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful
reference if you need additional context or clarification while following the video steps.
This is not a graded activity, but you can complete these steps to practice the skills
demonstrated in the video.

What you’ll need

If you would like to access the spreadsheet the instructor uses in this video, click the
link to the dataset to create a copy. If you don’t have a Google account, you may
download the data directly from the attachments below.

Link to logistics data: International Logistics Association Memberships - Data for


Cleaning

Link to cosmetics data: Cosmetics Inc. - Data for Cleaning

OR

60
Example 1: The COUNTIF function

COUNTIF is a spreadsheet function that returns the number of cells within a range that
match a specified value.

Use COUNTIF to find numbers lower than 100

1. Open the International Logistics Association Memberships - Data for Cleaning dataset,
and scroll down to row 74.
1. Note: The dataset has 72 rows, and row 73 is left blank for separation.
2. In cell H74, enter Member Dues < 100 to label the calculation.
3. In cell I74, enter the formula =COUNTIF(I2:I72,"<100") to count how many
members in the cell range I2:I72 pay dues of less than $100. This formula returns a
value of 1, indicating one value is below $100.
4. In cell I55, change -$200 to $200. Cell I74 should now display the value 0.

Use COUNTIF to find numbers higher than 500

1. In cell H75, enter Member Dues > 500.


2. In cell I75, enter the formula =COUNTIF(I2:I72,">500") to count how many
members in cell range I2:I72 pay dues of greater than 500. This formula returns a
value of 1, indicating one value is above 500.
3. In cell I44, change $1,000 to $100. Cell I75 should now display the value 0.

Example 2: The LEN function

The LEN function is useful if you have a certain piece of information in your spreadsheet
that you know must contain a certain length.

1. Right click cell A.


2. Select + Insert one column right to create a new, empty column.

61
3. Select cell B1 and enter LEN to name the new column.
4. In cell B2, enter =LEN(A2). This function references the value of cell A2 and returns its
length, 6.
5. Double-click on the lower right corner of cell B2. This will copy the function through the
rest of the column. Each cell will show the length of the Member ID in that row.

Example 3: Use conditional formatting

Conditional formatting is a spreadsheet tool that changes how cells appear when values
meet specific conditions.

1. To highlight all of column B except for the header, select cell B. Then press CONTROL
(Windows) or COMMAND (MAC) and select cell B1.
2. Navigate to the Format menu, and choose Conditional Formatting.
3. Set the Format rules field to Is not equal to and enter 6 as the value.
4. Select Done.
5. Notice cell B36 is highlighted because its value is 7.

Example 4: The LEFT and RIGHT functions

LEFT is a function that returns a set number of characters from the left side of a text
string. RIGHT is a function that returns a set number of characters from the right side of
a text string.

The LEFT function

1. Use the Cosmetics Inc. - Data for Cleaning dataset.


2. Select cell H1, and enter Left.
3. In cell H2, enter =LEFT(A2, 5) to extract the first five characters from cell A2. This
function will show the substring 51993.
4. Select cell H2.
5. Select and hold the fill handle, the small circle in the corner of a selected cell, then drag
this formula down to populate the rest of this column.

62
The RIGHT function

1. Select cell I1, and enter Right.


2. In cell I2, enter =RIGHT(A2, 4) to extract the last four characters from cell A2. This
function will show the substring Masc.
3. Select cell I2.
4. Select and hold the fill handle and drag this formula down to populate the rest of this
column.

Example 5: The MID function

MID is a function that returns a segment from the middle of a text string.

1. Select cell J1, and enter Mid.


2. In cell J2, enter =MID(D2, 4, 2) to extract the two-letter state code that starts at
character four in cell D2.
3. Double-click the fill handle and to automatically populate the rest of this column.

Example 6: The CONCATENATE function

CONCATENATE is a spreadsheet function that joins together two or more text strings.

1. Select cell K1, and enter Concatenate.


2. In cell K2, enter =CONCATENATE(H2, I2) to combine the values from columns H and
I.
3. Double-click the fill handle and to automatically populate the rest of this column.

Example 7: TRIM function

TRIM is a function that removes leading, trailing, and repeated spaces in data.

1. Select cell L1, and enter Trim.


2. In cell L2, enter =TRIM(C2) to remove any leading, trailing, or repeated spaces.
3. Double-click the fill handle and to automatically populate the rest of this column.

Optimize the data-cleaning process


en

63
Interactive Transcript - Enable basic transcript
mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Welcome back. You've learned about some very useful data- cleaning tools that are built right
into spreadsheet applications. Now we'll explore how functions can optimize your efforts to
ensure data integrity. As a reminder, a function is a set of instructions that performs a specific
calculation using the data in a spreadsheet. The first function we'll discuss is called
COUNTIF. COUNTIF is a function that returns the number of cells that match a specified
value. Basically, it counts the number of times a value appears in a range of cells.

Let's go back to our professional association spreadsheet. In this example, we want to make
sure the association membership prices are listed accurately. We'll use COUNTIF to check for
some common problems, like negative numbers or a value that's much less or much greater
than expected. To start, let's find the least expensive membership: $100 for student
associates. That'll be the lowest number that exists in this column. If any cell has a value
that's less than 100, COUNTIF will alert us. We'll add a few more rows at the bottom of our
spreadsheet, then beneath column H, type "member dueS less than $100." Next, type the
function in the cell next to it.

Every function has a certain syntax that needs to be followed for it to work. Syntax is a
predetermined structure that includes all required information and its proper placement. The
syntax of a COUNTIF function should be like this: Equals COUNTIF, open parenthesis, range,
comma, the specified value in quotation marks and a closed parenthesis. It will show up like
this. Where I2 through I72 is the range, and the value is less than 100. This tells the function to
go through column I, and return a count of all cells that contain a number less than 100. Turns
out there is one!

Scrolling through our data, we find that one piece of data was mistakenly keyed in as a negative
number. Let's fix that now. Now we'll use COUNTIF to search for any values that are more than
we would expect. The most expensive membership type is $500 for corporate members. Type

64
the function in the cell. This time it will appear like this: I2 through I72 is still the range, but the
value is greater than 500. There's one here too. Check it out.

This entry has an extra zero. It should be $100. The next function we'll discuss is called
LEN. LEN is a function that tells you the length of the text string by counting the number of
characters it contains. This is useful when cleaning data if you have a certain piece of
information in your spreadsheet that you know must contain a certain length. For example, this
association uses six-digit member identification codes. If we'd just imported this data and
wanted to be sure our codes are all the correct number of digits, we'd use LEN.

The syntax of LEN is equals LEN, open parenthesis, the range, and the close parenthesis. We'll
insert a new column after Member ID. Then type an equals sign and LEN. Add an open
parenthesis. The range is the first Member ID number in A2. Finish the function by closing the
parenthesis. It tells us that there are six characters in cell A2.

Let's continue the function through the entire column and find out if any results are not six. But
instead of manually going through our spreadsheet to search for these instances, we'll use
conditional formatting. We talked about conditional formatting earlier. It's a spreadsheet tool
that changes how cells appear when values meet specific conditions. Let's practice that
now. Select all of column B except for the header. Then go to Format and choose Conditional
formatting.

The format rule is to format cells if not equal to six. Click "Done." The cell with the seven inside
is highlighted. Now we're going to talk about LEFT and RIGHT. LEFT is a function that gives
you a set number of characters from the left side of a text string. RIGHT is a function that gives
you a set number of characters from the right side of a text string. As a quick reminder, a text
string is a group of characters within a cell, commonly composed of letters, numbers, or both.

To see these functions in action, let's go back to the spreadsheet from the cosmetics maker
from earlier. This spreadsheet contains product codes. Each has a five-digit numeric code and
then a four-character text identifier. But let's say we only want to work with one side or the
other. You can use LEFT or RIGHT to give you the specific set of characters or numbers you
need. We'll practice cleaning up our data using the LEFT function first. The syntax of LEFT is
equals LEFT, open parenthesis, the range, a comma, and a number of characters from the left
side of the text string we want.

65
Then, we finish it with a closed parenthesis. Here, our project requires just the five-digit numeric
codes. In a separate column, type equals LEFT, open parenthesis, then the range. Our range is
A2. Then, add a comma, and then number 5 for our five- digit product code. Finally, finish the
function with a closed parenthesis. Our function should show up like this.

Press "Enter." And now, we have a substring, which is the number part of the product code
only. Click and drag this function through the entire column to separate out the rest of the
product codes by number only. Now, let's say our project only needs the four-character text
identifier. For that, we'll use the RIGHT function, and the next column will begin the
function. The syntax is equals RIGHT, open parenthesis, the range, a comma and the number
of characters we want. Then, we finish with a closed parenthesis.

Let's key that in now. Equals right, open parenthesis, and the range is still A2. Add a
comma. This time, we'll tell it that we want the first four characters from the right. Close up the
parenthesis and press "Enter." Then, drag the function throughout the entire column. Now, we
can analyze the product in our spreadsheet based on either substring.

The five-digit numeric code or the four character text identifier. Hopefully, that makes it clear
how you can use LEFT and RIGHT to extract substrings from the left and right sides of a
string. Now, let's learn how you can extract something in between. Here's where we'll use
something called MID. MID is a function that gives you a segment from the middle of a text
string. This cosmetics company lists all of its clients using a client code. It's composed of the
first three letters of the city where the client is located, its state abbreviation, and then a three-
digit identifier.

But let's say a data analyst needs to work with just the states in the middle. The syntax for MID
is equals MID, open parenthesis, the range, then a comma. When using MID, you always need
to supply a reference point. In other words, you need to set where the function should
start. After that, place another comma, and how many middle characters you want. In this case,
our range is D2. Let's start the function in a new column.

Type equals MID, open parenthesis, D2. Then the first three characters represent a city
name, so that means the starting point is the fourth. Add a comma and four. We also need to tell
the function how many middle characters we want. Add one more comma, and two, because
the state abbreviations are two characters long. Press "Enter" and bam, we just get the state
abbreviation. Continue the MID function through the rest of the column.

66
We've learned about a few functions that help separate out specific text strings. But what if we
want to combine them instead? For that, we'll use CONCATENATE, which is a function that
joins together two or more text strings. The syntax is equals CONCATENATE, then an open
parenthesis inside indicates each text string you want to join, separated by commas. Then finish
the function with a closed parenthesis. Just for practice, let's say we needed to rejoin the left
and right text strings back into complete product codes. In a new column, let's begin our
function.

Type equals CONCATENATE, then an open parenthesis. The first text string we want to join
is in H2. Then add a comma. The second part is in I2. Add a closed parenthesis and press
"Enter". Drag it down through the entire column, and just like that, all of our product codes are
back together. The last function we'll learn about here is TRIM.

TRIM is a function that removes leading, trailing, and repeated spaces in data. Sometimes when
you import data, your cells have extra spaces, which can get in the way of your analysis. For
example, if this cosmetics maker wanted to look up a specific client name, it won't show up in
the search if it has extra spaces. You can use TRIM to fix that problem. The syntax for TRIM is
equals TRIM, open parenthesis, your range, and closed parenthesis. In a separate column, type
equals TRIM and an open parenthesis. The range is C2, as you want to check out the client
names.

Close the parenthesis and press "Enter". Finally, continue the function down the column. TRIM
fixed the extra spaces. Now we know some very useful functions that can make your data
cleaning even more successful. This was a lot of information. As always, feel free to go back
and review the video and then practice on your own. We'll continue building on these tools
soon, and you'll also have a chance to practice.

Workflow automation
In this reading, you will learn about workflow automation and how it can help you work
faster and more efficiently. Basically, workflow automation is the process of automating
parts of your work. That could mean creating an event trigger that sends a notification
when a system is updated. Or it could mean automating parts of the data cleaning
process. As you can probably imagine, automating different parts of your work can save

67
you tons of time, increase productivity, and give you more bandwidth to focus on other
important aspects of the job.

What can be automated?

Automation sounds amazing, doesn’t it? But as convenient as it is, there are still some
parts of the job that can’t be automated. Let's take a look at some things we can
automate and some things that we can’t.

Task Can it be Why?


automate
d?

Communicating with No Communication is key to understanding the needs of you


your team and team and stakeholders as you complete the tasks you ar
stakeholders working on. There is no replacement for person-to-person
communications.

68
Task Can it be Why?
automate
d?

Presenting your No Presenting your data is a big part of your job as a data
findings analyst. Making data accessible and understandable to
stakeholders and creating data visualizations can’t be
automated for the same reasons that communications ca
be automated.

Preparing and Partially Some tasks in data preparation and cleaning can be
cleaning data automated by setting up specific processes, like using a
programming script to automatically detect missing value

Data exploration Partially Sometimes the best way to understand data is to see it.
Luckily, there are plenty of tools available that can help
automate the process of visualizing data. These tools can
speed up the process of visualizing and understanding th
data, but the exploration itself still needs to be done by a
data analyst.

Modeling the data Yes Data modeling is a difficult process that involves lots of
different factors; luckily there are tools that can complete
automate the different stages.

More about automating data cleaning

One of the most important ways you can streamline your data cleaning is to clean data
where it lives. This will benefit your whole team, and it also means you don’t have to
repeat the process over and over. For example, you could create a programming script
that counted the number of words in each spreadsheet file stored in a specific folder.
Using tools that can be used where your data is stored means that you don’t have to
repeat your cleaning steps, saving you and your team time and energy.

69
More resources

There are a lot of tools out there that can help automate your processes, and those
tools are improving all the time. Here are a few articles or blogs you can check out if you
want to learn more about workflow automation and the different tools out there for you
to use:

 Towards Data Science’s Automating Scientific Data Analysis


 MIT News’ Automating Big-Data Analysis
 TechnologyAdvice’s 10 of the Best Options for Workflow Automation
Software

Key takeaways

As a data analyst, automation can save you a lot of time and energy, and free you up to
focus more on other parts of your project. The more analysis you do, the more ways you
will find to make your processes simpler and more streamlined.

Step-by-Step: Different data


perspectives
This reading outlines the steps the instructor performs in the next video, Different data
perspectives. The video teaches you different methods data analysts use to view data
differently and how looking at different views leads to more efficient and effective data
cleaning.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful
reference if you need additional context or clarification while following the video steps.
This is not a graded activity, but you can complete these steps to practice the skills
demonstrated in the video.

What you’ll need

If you’d like to follow along with the examples in this video, choose a spreadsheet tool.
Google Sheets or Excel are recommended.
70
To access the spreadsheet the instructor uses in this video, click the link to the template
to create a copy of the dataset. If you don’t have a Google account, download the data
directly from the attachments below.

Link to template: Cosmetics, Inc.

OR

Example 1: Pivot tables

A pivot table is a data summarization tool. It can be used in data processing and in data
cleaning, for which pivot tables offer a quick, clutter-free view of your data. Pivot tables
help sort, reorganize, group, count, total, or average data in a dataset.

1. In the Cosmetics Inc. spreadsheet, select the data you'll include. In this case, select all
of the data in Sheet 1 of the spreadsheet by selecting cell A1 then dragging your cursor
to cell F31.
2. Select Insert, then Pivot Table. Choose New sheet and Create. Google Sheets
creates a new sheet where you can define the pivot table.
3. Use the Pivot table editor to add specific data to your pivot table.

a. In the Pivot table editor panel, next to Rows, select Add.

b. From the columns list, select Total.

c. Below Rows, from the Order dropdown list, select Descending to put the most
profitable items at the top.

d. Next to Rows, select Add.

e. From the column list, select Products.

f. Notice that the top two most ordered products are 15143Exfo and 32729Masc.
The rest of the orders total less than $10,000.

71
Example 2: VLOOKUP

VLOOKUP is a spreadsheet function that vertically searches for a certain value in a


column to return a corresponding piece of information. It's rare for all of the data an
analyst will need to be in the same place. Usually, you'll have to search across multiple
sheets or even different databases. VLOOKUP helps bring the information together.

In the previous example, you found the product codes of the most ordered products.
Now, you’ll use VLOOKUP to find the names of these products.

1. Select the Sheet 1 tab to navigate to Sheet 1 of the spreadsheet.


2. Select cell H2.
3. Enter =VLOOKUP(A2, 'Sheet 2'!A1:B31, 2, false)
1. Note: This references information in another sheet. Make sure you have Sheet 2 in
your workbook.
2. This formula will take the value in cell A2 of Sheet 1 and check for that value in
Sheet 2 among the cells from A1:B31 in the 2nd column (which corresponds with
the 2 in the formula). Because the formula includes “false,” it will search only for an
exact match. It will then output the value of column B in Sheet 2 as the result.
4. Press Enter to input the formula. The result is LashX Mascara.
5. Next, select the cell and drag the fill handle in the lower-right corner down to
populate the other cells in the sheet with the formula.
6. To find the names of the two most profitable products you identified in the previous
example use the find and replace tool.
1. Select Edit > Find and Replace.
2. In the Find text box, enter the product code for the most profitable product,
15143Exfo.
3. Select This sheet from the dropdown list next to Search.
4. Select Find to find any cells in this sheet that contain this product code.
5. Notice that cell A31 is a match. This means the VLOOKUP search you ran in Column
H31 contains the name of the most profitable product: SoSoft Exfoliator.
6. Repeat steps a-d with the product code 32729Masc to find the product name of the
second most profitable product. Cell A27 contains 32729Masc, so the product name
is Darkest Lashes Mascara.

72
Example 3: Plotting

The plotting tool allows analysts to quickly create a graph, chart, table, or other visual
from data. Plotting is useful for identifying skewed data or outliers.

1. In Sheet 1 of the Cosmetics, Inc. spreadsheet, select column B, which contains


the prices.
2. Select Insert > Chart.
1. If the chart created is not a column chart, select Column chart from the dropdown
menu under Chart type in the Chart editor.
2. Select and drag the chart to the right so you can view the data in the sheet.
3. Check for obvious outliers and fix them in the spreadsheet. For example, you might
notice that an item in the middle of the chart has an extremely low price of $0.73. The
decimal point is in the wrong place. In cell B14, correct this price to $7.30, and notice
that Google Sheets automatically updates the chart.

Different data perspectives


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hi, let's get into it. Motivational speaker Wayne Dyer once said, "If you change the way you look
at things, the things you look at change." This is so true in data analytics. No two analytics
projects are ever exactly the same. So it only makes sense that different projects require us to
focus on different information differently. In this video, we'll explore different methods that data
analysts use to look at data differently and how that leads to more efficient and effective data
cleaning. Some of these methods include sorting and filtering, pivot tables, a function called
VLOOKUP, and plotting to find outliers.

73
Let's start with sorting and filtering. As you learned earlier, sorting and filtering data helps data
analysts customize and organize the information the way they need for a particular project. But
these tools are also very useful for data cleaning. You might remember that sorting involves
arranging data into a meaningful order to make it easier to understand, analyze, and
visualize. For data cleaning, you can use sorting to put things in alphabetical or numerical
order, so you can easily find a piece of data. Sorting can also bring duplicate entries closer
together for faster identification. Filters, on the other hand, are very useful in data cleaning when
you want to find a particular piece of information.

You learned earlier that filtering means showing only the data that meets a specific criteria while
hiding the rest. This lets you view only the information you need. When cleaning data, you might
use a filter to only find values above a certain number, or just even or odd values. Again, this
helps you find what you need quickly and separates out the information you want from the
rest. That way you can be more efficient when cleaning your data. Another way to change the
way you view data is by using pivot tables. You've learned that a pivot table is a data
summarization tool that is used in data processing.

Pivot tables sort, reorganize, group, count, total or average data stored in the database. In data
cleaning, pivot tables are used to give you a quick, clutter- free view of your data. You can
choose to look at the specific parts of the data set that you need to get a visual in the form of a
pivot table. Let's create one now using our cosmetic makers spreadsheet again. To start, select
the data we want to use. Here, we'll choose the entire spreadsheet. Select "Data" and then
"Pivot table."

Choose "New sheet" and "Create." Let's say we're working on a project that requires us to look
at only the most profitable products. Items that earn the cosmetics maker at least $10,000 in
orders. So the row we'll include is "Total" for total profits. We'll sort in descending order to put
the most profitable items at the top. And we'll show totals. Next, we'll add another row for
products so that we know what those numbers are about.

We can clearly determine tha the most profitable products have the product codes 15143 E-X-F-
O and 32729 M-A-S-C. We can ignore the rest for this particular project because they fall below
$10,000 in orders. Now, we might be able to use context clues to assume we're talking about
exfoliants and mascaras. But we don't know which ones, or if that assumption is even

74
correct. So we need to confirm what the product codes correspond to. And this brings us to the
next tool. It's called VLOOKUP.

VLOOKUP stands for vertical lookup. It's a function that searches for a certain value in a column
to return a corresponding piece of information. When data analysts look up information for a
project, it's rare for all of the data they need to be in the same place. Usually, you'll have to
search across multiple sheets or even different databases. The syntax of the VLOOKUP is
equals VLOOKUP, open parenthesis, then the data you want to look up. Next is a comma and
where you want to look for that data. In our example, this will be the name of a spreadsheet
followed by an exclamation point.

The exclamation point indicates that we're referencing a cell in a different sheet from the one
we're currently working in. Again, that's very common in data analytics. Okay, next is the range
in the place where you're looking for data, indicated using the first and last cell separated by a
colon. After one more comma is the column in the range containing the value to return. Next,
another comma and the word "false," which means that an exact match is what we're looking
for. Finally, complete your function by closing the parentheses. To put it simply, VLOOKUP
searches for the value in the first argument in the leftmost column of the specified location.

Then the value of the third argument tells VLOOKUP to return the value in the same row from
the specified column. The "false" tells VLOOKUP that we want an exact match. Soon you'll learn
the difference between exact and approximate matches. But for now, just know that V lookup
takes the value in one cell and searches for a match in another place. Let's begin. We'll type
equals VLOOKUP. Then add the data we are looking for, which is the product data.

The dollar sign makes sure that the corresponding part of the reference remains
unchanged. You can lock just the column, just the row, or both at the same time. Next, we'll tell
it to look at Sheet 2, in both columns We added 2 to represent the second column. The last
term, "false," says we wanted an exact match. With this information, we can now analyze the
data for only the most profitable products. Going back to the two most profitable products, we
can search for 15143 E-X-F-O And 32729 M-A-S-C. Go to Edit and then Find. Type in the
product codes and search for them.

Now we can learn which products we'll be using for this particular project. The final tool we'll talk
about is something called plotting. When you plot data, you put it in a graph chart, table, or other
visual to help you quickly find what it looks like. Plotting is very useful when trying to identify any

75
skewed data or outliers. For example, if we want to make sure the price of each product is
correct, we could create a chart. This would give us a visual aid that helps us quickly figure out if
anything looks like an error. So let's select the column with our prices.

Then we'll go to Insert and choose Chart. Pick a column chart as the type. One of these prices
looks extremely low. If we look into it, we discover that this item has a decimal point in the
wrong place. It should be $7.30, not 73 cents. That would have a big impact on our total
profits. So it's a good thing we caught that during data cleaning.

Looking at data in new and creative ways helps data analysts identify all kinds of dirty
data. Coming up, you'll continue practicing these new concepts so you can get more
comfortable with them. You'll also learn additional strategies for ensuring your data is clean, and
we'll provide you with effective insights. Great work so far.

Step-by-Step: Even more data-


cleaning techniques
This reading outlines the steps the instructor performs in the next video, Even more
data-cleaning techniques. This video teaches you different methods data analysts use
in data mapping. Data mapping is the process of matching fields from one database to
another. It’s critical to the success of data migration, data integration, and many other
data-management activities. This video contains one activity for you to practice.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful
reference if you need additional context or clarification while following the video steps.
This is not a graded activity, but you can complete these steps to practice the skill
demonstrated in the video.

What you’ll need

If you’d like to follow along with the example in this video, choose a spreadsheet tool,
such as Google Sheets or Excel.

76
To access the spreadsheet the instructor uses in this video, click the link to the template
to create a copy of the dataset. If you don’t have a Google account, download the data
directly from the attachments below.

Link to templates:

International Logistics Association memberships

Global Logistics Association

Logistics Association Merger

Downloads:

Example: CONCATENATE

CONCATENATE is a function that joins together two or more text strings. In the video,
you’ll learn how to use CONCATENATE to clean data after two datasets have been
combined.

1. Open the dataset spreadsheet titled Global Logistics Association. When


prompted, select USE TEMPLATE.
2. Insert a new column to the right of column E. Label it New Address in cell F1.
3. In the second row of the new column (cell F2), enter =CONCATENATE (D2,E2) and
press Enter.
1. You will notice that some results need a space between the street address and the unit
or suite number, such as: 25 Dyas RdSte. 101.

77
2. You could manually clean the data later to add a space between Rd and Ste., but
CONCATENATE can actually do it for you.
3. The CONCATENATE formula can help you format the data as it is merged by entering an
additional string to insert a space between Rd and Ste.
4. Enter =CONCATENATE(D2, " ", E2) and you will have an address that is formatted
like this: 25 Dyas Rd Ste. 101. Much better!
4. Ensure the new data in the cell accurately reflects the merging of the two previous
columns.
5. Select cell F2 and drag down to apply the formula to all rows in the column.

Even more data-cleaning techniques


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hello. So far you've learned about a lot of different tools and functions that analysts use to clean
up data for analysis. Now we'll take a step back and talk about some of the really big picture
aspects of clean data. Knowing how to fix specific problems, either manually with spreadsheet
tools, or with functions, is extremely valuable. But it's also important to think about how your
data has moved between systems and how it's evolved along it's journey to your data analysis
project. To do this, data analysts use something called data mapping. Data mapping is the
process of matching fields from one database to another.

This is very important to the success of data migration, data integration, and lots of other data
management activities. As you learned earlier, different systems store data in different
ways. For example, the state field in one spreadsheet might show Maryland spelled out. But
another spreadsheet might store it as MD. Data mapping helps us note these kinds of
differences so we know when data is moved and combined it will be compatible. As a quick
reminder, compatibility describes how well two or more data sets are able to work together. The
first step to data mapping is identifying what data needs to be moved.

78
This includes the tables and the fields within them. We also need to define the desired format
for the data once it reaches its destination. To figure out how this works let's go back to the
merger between our two logistics associations. Starting with the first data field, we'll identified
that we need to move both sets of member IDs. To define the desired format, we'll choose
whether to use numbers like this spreadsheet, or email addresses like the other
spreadsheet. Next comes mapping the data. Depending on the schema and number of primary
and foreign keys in a data source, data mapping can be simple or very complex.

As a reminder, a schema is a way of describing how something is organized. A primary key


references a column in which each value is unique and a foreign key is a field within a table that
is a primary key in another table. For more challenging projects there's all kinds of data mapping
software programs you can use. These data mapping tools will analyze field by field how to
move data from one place to another then they automatically clean, match, inspect, and validate
the data. They also create consistent naming conventions, ensuring compatibility when the
data is transferred from one source to another. When selecting a software program to map your
data, you want to be sure that it supports the file types you're working with, such as Excel, SQL,
Tableau, and others. Later on, you'll learn more about selecting the right tool for a particular
task.

For now, let's practice mapping data manually. First, we need to determine the content of each
section to make sure the data ends up in the right place. For example, the data on when
memberships expire would be consolidated into a single column. This step makes sure that
each piece of information ends up in the most appropriate place in the merged data
source. Now, you might remember that some of the data was inconsistent between the two
organizations, like the fact that one uses a separate column for suite apartment or unit number
but the other doesn't. This brings us to the next step, transforming the data into a consistent
format. This is a great time to use concatenate.

As you learned before, concatenate is a function that joins together two or more text
strings, which is what we did earlier with our cosmetics company example. We'll insert a new
column and then type equals concatenate, then the two text strings we want to make one. Drag
that through the entire column. Now we have the consistency in the new merged association
lists of member addresses. Now that everything's compatible, it's time to transfer the data to its
destination. There's a lot of different ways to move data from one place to another, including
querying, import wizards, and even simple drag and drop. Here's our merged spreadsheet.

79
It looks good, but we still want to make sure everything was transferred properly. We'll go into
the testing phase of data mapping. For this, you inspect a sample piece of data to confirm that
it's clean and properly formatted. It's also a smart practice to do spot checks on things such as
the number of nulls. For the test, you can use a lot of the data cleaning tools we discussed
previously, such as data validation, conditional formatting, COUNTIF, sorting, and
filtering. Finally, once you've determined that the data is clean and compatible, you can start
using it for analysis. Data mapping is so important because even one mistake when merging
data can ripple throughout an organization, causing the same error to appear again and again.

This leads to poor results. On the other hand, data mapping can save the day by giving you a
clear road map you can follow to make sure your data arrives safely at it's destination. That's
why you learn how to do it.

Working with .csv files


In an earlier course in this certificate program, you worked with .csv files. Data analysts
use .csv files often, so throughout this course you will continue to use .csv files to
transfer data into data analysis programs for further analysis and visualization. .csv files
are plain text files with an organized table structure that includes rows and columns.
The values in each row are separated by commas. This table structure makes them
easy to understand, edit, manipulate, and use for data analysis.

A major advantage of .csv files is their widespread compatibility. They can be imported
and exported by a vast range of data analysis tools and software programs.

Download .csv files

To use .csv files and upload them to data analysis programs you will first need to
download them to your local device. Downloading a .csv file from a website can vary
depending on your operating system or internet browser. Here are some ways you can
download a .csv file:

 Click the download link or .csv attachment: Locate the link for the .csv file
or attachment on the website. Click on it, and the download process will start.

80
 Right-click and Save: Right-click on the data table or element containing
the .csv data. Choose Save as… or a similar option. Name the file and make sure the
extension on the file is “.csv”.
 Force download: You can use the Alt key on your keyboard while clicking the link.
This will trigger the download, and you will be able to find the .csv file in your
Downloads folder.

Note: When using the Chrome browser or ChromeOS, .csv files may open in a new
tab instead of downloading to your machine. If this happens, follow these instructions:

 Select File from the menu bar, then select Save as Google Sheets. This will open
the .csv file as a Google Sheet.
 Select File from the menu bar, then select Download from the dropdown menu, then
select Comma Separated Values (.csv).

Upload .csv files

You will often need to upload .csv files during the data analysis process. Here is how
you do this:

 Locate the upload option: Each data analysis platform will have a designated
button, menu option, or drag-and-drop area labeled Upload or Import. This is where
you will upload your .csv file.
 Choose your .csv file: Click Upload or Import on the platform you are using to
open your file explorer. Select your .csv file. If you just downloaded a .csv file from the
web, it will be located in your computer’s Downloads folder.
 Initiate the upload: Once you've selected your .csv file, click Upload or Import.
The platform may display a progress bar or message indicating that the upload is
complete.

Note: Some platforms have restrictions on the file size or format of .csv files. Make
sure your .csv files adhere to these requirements before uploading.

Develop your approach to


cleaning data
81
As you continue on your data journey, you’re likely discovering that data is often messy
—and you can expect raw, primary data to be imperfect. In this reading, you’ll consider
how to develop your personal approach to cleaning data. You will explore the idea of a
cleaning checklist, which you can use to guide your cleaning process. Then, you’ll
define your preferred methods for cleaning data. By the time you complete this reading,
you’ll have a better understanding of how to methodically approach the data cleaning
process. This will save you time when cleaning data and help you ensure that your data
is clean and usable.

Consider your approach to cleaning data

Data cleaning usually requires a lot of time, energy, and attention. But there are two
steps you can take before you begin to help streamline your process: creating a
cleaning checklist and deciding on your preferred methods. This will help ensure that
you know exactly how you want to approach data cleaning and what you need to do to
be confident in the integrity of your data.

Your cleaning checklist

Start developing your personal approach to cleaning data by creating a checklist to help
you identify problems in your data efficiently and identify the scale and scope of your
dataset. Think of this checklist as your default “what to search for” list.

Here are some examples of common data cleaning tasks you could include in your
checklist:

 Determine the size of the dataset: Large datasets may have more data quality
issues and take longer to process. This may impact your choice of data cleaning
techniques and how much time to allocate to the project.
 Determine the number of categories or labels: By understanding the
number and nature of categories and labels in a dataset, you can better understand the
diversity of the dataset. This understanding also helps inform data merging and
migration strategies.
 Identify missing data: Recognizing missing data helps you understand data
quality so you can take appropriate steps to remediate the problem. Data integrity is
important for accurate and unbiased analysis.

82
 Identify unformatted data: Identifying improperly or inconsistently formatted
data helps analysts ensure data uniformity. This is essential for accurate analysis and
visualization.
 Explore the different data types: Understanding the types of data in your
dataset (for instance, numerical, categorical, text) helps you select appropriate cleaning
methods and apply relevant data analysis techniques.

There might be other data cleaning tasks you’ve been learning about that you also want
to prioritize in your checklist. Your checklist is an opportunity for you to define exactly
what you want to remember about cleaning your data; feel free to make it your own.

Your preferred cleaning methods

In addition to creating a checklist, identify which actions or tools you prefer using when
cleaning data. You’ll use these tools and techniques with each new dataset—or
whenever you encounter issues in a dataset—so this list should be compatible with your
checklist.

For example, suppose you have a large dataset with missing data. You’ll want to know
how to check for missing data in larger datasets, and how you plan to handle any
missing data, before you start cleaning. Outlining your preferred methods can save you
lots of time and energy.

Glossary terms from module 2


Terms and definitions for Course 4, Module 2

Clean data: Data that is complete, correct, and relevant to the problem being solved

Compatibility: How well two or more datasets are able to work together

CONCATENATE: A spreadsheet function that joins together two or more text strings

Conditional formatting: A spreadsheet tool that changes how cells appear when
values meet specific conditions

Data engineer: A professional who transforms data into a useful format for analysis
and gives it a reliable infrastructure

83
Data mapping: The process of matching fields from one data source to another

Data merging: The process of combining two or more datasets into a single dataset

Data validation: A tool for checking the accuracy and quality of data

Data warehousing specialist: A professional who develops processes and


procedures to effectively store and organize data

Delimiter: A character that indicates the beginning or end of a data item

Dirty data: Data that is incomplete, incorrect, or irrelevant to the problem to be


solved

Duplicate data: Any record that inadvertently shares data with another record

Field length: A tool for determining how many characters can be keyed into a
spreadsheet field

Incomplete data: Data that is missing important fields

Inconsistent data: Data that uses different formats to represent the same thing

Incorrect/inaccurate data: Data that is complete but inaccurate

LEFT: A function that returns a set number of characters from the left side of a text
string

LEN: A function that returns the length of a text string by counting the number of
characters it contains

Length: The number of characters in a text string

Merger: An agreement that unites two organizations into a single new one

MID: A function that returns a segment from the middle of a text string

Null: An indication that a value does not exist in a dataset

Outdated data: Any data that has been superseded by newer and more accurate
information

84
Remove duplicates: A spreadsheet tool that automatically searches for and
eliminates duplicate entries from a spreadsheet

Split: A function that divides text around a specified character and puts each fragment
into a new, separate cell

Substring: A smaller subset of a text string

Text string: A group of characters within a cell, most often composed of letters

TRIM: A function that removes leading, trailing, and repeated spaces in data

Unique: A value that can’t have a duplicate

MODULE 03

Sally: For the love of SQL


Advertising agencies get money from their clients to advertise their brand. These agencies use
our products, use certain Google platforms, advertising platforms, and I help them with how to
best use those platforms, different strategies they can use to be best in class. A lot of the folks
at the advertising agencies have reports that they have to send out to their clients. These
reports take a lot of time to create and visualize, and so what I do is I help the practitioners and
the analytics teams use a particular product that enables them to create those reports much
faster and much easier. If you're going to start off as a data analyst, it opens tons of
doors because everybody is tracking data, is using data, needs to use data, regardless of
industry. Anywhere from health care, to advertising, to e-commerce, to entertainment, anything
and everything, everybody uses data, so everybody needs you as a data analyst. SQL makes
our lives easier when we're analyzing lots of different data.

It's only somewhat recently that the SQL programs that we use now can give us instant results
for analyzing millions or billions of data. Years ago, maybe about five years ago or so, even
though we could still analyze those millions of rows, we would end up having to wait fifteen
minutes, thirty minutes for the queries to run. But now it's instantaneous, and so that's really
exciting, and we can do so much more with that power. SQL has helped a lot in my career
because it's one of those fundamental things you have to know as a data analyst. Back in the
day, not everyone knew SQL, so knowing SQL was definitely a competitive

85
advantage. Nowadays, I would say more people, maybe most people know it. It's a core skill
and highly sought after by everybody.

So, knowing SQL, becoming a data analyst makes you quite popular from recruiters, so I think
that's really fun. I taught myself SQL, so my knowledge about SQL is something I hold near and
dear, close to my heart since it's something that almost I've made for myself, and I feel so much
satisfaction from it. So that's why I really like SQL. One of the fun things about SQL and another
reason why I really enjoy using it is because when you type something in that query, and you
just hit Control, Shift, Enter, or once you've run the query, you get the results almost instantly,
depending on the platform you use. But it's fascinating to see if you think conceptually how
much analysis the computer is doing for you based on that little bit of command code or a little
bit of code you wrote, and it's just so powerful if you think about what's happening behind the
scenes. So I think that's fun to look at. We live in a world of big data, and it keeps getting
bigger.

The computing power is also increasing exponentially. With all the data that we can track, the
more and more we can track that data, the more and more we need data analysts. Our career
prospects are basically skyrocketing. I'm Sally, I'm a measurement and analytical lead at
Google.

Understand SQL capabilities


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hello, again. So before we go over all the ways data analysts use SQL to clean data, I want to
formally introduce you to SQL. We've talked about SQL a lot already. You've seen some
databases and some basic functions in SQL, and you've even seen how SQL can be used to
process data. But now let's actually define SQL. SQL is a structured query language that

86
analysts use to work with databases. Data analysts usually use SQL to deal with large
datasets because it can handle huge amounts of data.

And I mean trillions of rows. That's a lot of rows to wrap your head around. So let me give you
an idea about how much data that really is. Imagine a data set that contains the names of all 8
billion people in the world. It would take the average person 101 years to read all 8 billion
names. SQL can process this in seconds. Personally, I think that's pretty cool.

Other tools like spreadsheets might take a really long time to process that much data, which is
one of the main reasons data analysts choose to use SQL, when dealing with big datasets. Let
me give you a short history on SQL. Development on SQL actually began in the early 70s. In
1970, Edgar [Link] developed the theory about relational databases. You might remember
learning about relational databases a while back. This is a database that contains a series of
tables that can be connected to form relationships. At the time IBM was using a relational
database management system called System R.

Well, IBM computer scientists were trying to figure out a way to manipulate and retrieve data
from IBM System R. Their first query language was hard to use. So they quickly moved on to
the next version, SQL. In 1979, after extensive testing SQL, now just spelled S-Q-L, was
released publicly. By 1986, SQL had become the standard language for relational database
communication, and it still is. This is another reason why data analysts choose SQL. It's a well-
known standard within the community.

The first time I used SQL to pull data from a real database was for my first job as a data
analyst. I didn't have any background knowledge about SQL before that. I only found out about
it because it was a requirement for that job. The recruiter for that position gave me a week to
learn it. So I went online and researched it and ended up teaching myself SQL. They actually
gave me a written test as part of the job application process. I had to write SQL queries and
functions on a whiteboard.

But I've been using SQL ever since. And I really like it. And just like I learned SQL on my own, I
wanted to remind you that you can figure things out yourself too. There's tons of great online
resources for learning. So don't let one job requirement stand in your way without doing some
research first. Now that we know a little more about why analysts choose to work with SQL
when they're handling a lot of data and a little bit about the history of SQL, we'll move on and

87
learn some practical applications for it. Coming up next, we'll check out some of the tools we
learned in spreadsheets and figure out if any of those apply to working in SQL.

How a junior data analyst uses


SQL
In this reading, you will learn more about how to decide when to use SQL, or Structured
Query Language. As a data analyst, you will be tasked with handling a lot of data, and
SQL is one of the tools that can help make your work a lot easier. SQL is the primary
way data analysts extract data from databases. As a data analyst, you will work with
databases all the time, which is why SQL is such a key skill. Let’s follow along as a
junior data analyst uses SQL to solve a business task.

The business task and context

The junior data analyst in this example works for a social media company. A new
business model was implemented on February 15, 2020 and the company wants to
understand how their user-growth compares to the previous year. Specifically, the data
analyst was asked to find out how many users have joined since February 15, 2020.

88
Spreadsheets functions and formulas or SQL queries?

Before they can address this question, this data analyst needs to choose what tool to
use. First, they have to think about where the data lives. If it is stored in a database,
then SQL is the best tool for the job. But if it is stored in a spreadsheet, then they will
have to perform their analysis in that spreadsheet. In that scenario, they could create a
pivot table of the data and then apply specific formulas and filters to their data until they
were given the number of users that joined after February 15th. It isn’t a really
complicated process, but it would involve a lot of steps.

In this case, the data is stored in a database, so they will have to work with SQL. And
this data analyst knows they could get the same results with a single SQL query:

Spreadsheets and SQL both have their advantages and disadvantages:

Features of Spreadsheets Features of SQL Databases

Smaller data sets Larger datasets

Enter data manually Access tables across a database

Create graphs and visualizations in the same Prepare data for further analysis in another softwa
program

Built-in spell check and other useful functions Fast and powerful functionality

89
Features of Spreadsheets Features of SQL Databases

Best when working solo on a project Great for collaborative work and tracking queries
by all users

Key takeaways

When it comes down to it, where the data lives will decide which tool you use. If you are
working with data that is already in a spreadsheet, that is most likely where you will
perform your analysis. And if you are working with data stored in a database, SQL will
be the best tool for you to use for your analysis. You will learn more about SQL coming
up, so that you will be ready to tackle any business problem with the best tool possible.

Spreadsheets versus SQL


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hey there. So far we've learned about both spreadsheets and SQL. While there's lots
of differences between spreadsheets and SQL, you'll find some similarities too. Let's check out
what spreadsheets and SQL have in common and how they're different. Spreadsheets and SQL
actually have a lot in common. Specifically, there's tools you can use in both spreadsheets and
SQL to achieve similar results. We've already learned about some tools for cleaning data
in spreadsheets, which means you already know some tools that you can use in SQL.

For example, you can still perform arithmetic, use formulas and join data when you're using
SQL, so we'll build on the skills we've learned in spreadsheets and use them to do even more
complex work in SQL. Here's an example of what I mean by more complex work. If we were

90
working with health data for a hospital, we'd need to be able to access and process a lot of
data. We might need demographic data, like patients' names, birthdays, and
addresses, information about their insurance or past visits, public health data or even
user generated data to add to their patient records. All of this data is being stored in different
places, maybe even in different formats, and each location might have millions of rows and
hundreds of related tables. This is way too much data to input manually, even for just one
hospital. That's where SQL comes in handy.

Instead of having to look at each individual data source and record it in our spreadsheet, we can
use SQL to pull all this information from different locations in our database. Now, let's say we
want to find something specific in all this data, like how many patients with a certain diagnosis
came in today. In a spreadsheet we can use the COUNTIF function to find that out, or we
can combine the COUNT and WHERE queries in SQL to find out how many rows match our
search criteria. This will give us similar results, but works with a much larger and more complex
set of data. Next, let's talk about how spreadsheets and SQL are different. First, it's important to
understand that spreadsheets and SQL are different things. Spreadsheets are generated with a
program like Excel or Google Sheets.

These programs are designed to execute certain built-in functions. SQL on the other hand is a
language that can be used to interact with database programs, like Oracle MySQL or Microsoft
SQL Server. The differences between the two are mostly in how they're used. If a data analyst
was given data in the form of a spreadsheet they'll probably do their data cleaning and analysis
within that spreadsheet, but if they're working with a large data set with more than a million
rows or multiple files within a database, it's easier, faster and more repeatable to use SQL. SQL
can access and use a lot more data because it can pull information from different sources in the
database automatically, unlike spreadsheets which only have access to the data you input. This
also means that data is stored in multiple places. A data analyst might use spreadsheets stored
locally on their hard drive or their personal cloud when they're working alone, but if they're on a
larger team with multiple analysts who need to access and use data stored across a database,
SQL might be a more useful tool.

Because of these differences, spreadsheets and SQL are used for different things. As you
already know, spreadsheets are good for smaller data sets and when you're working
independently. Plus, spreadsheets have built-in functionalities, like spell check that can be really
handy. SQL is great for working with larger data sets, even trillions of rows of data. Because

91
SQL has been the standard language for communicating with databases for so long, it can be
adapted and used for multiple database programs. SQL also records changes in queries, which
makes it easy to track changes across your team if you're working collaboratively. Next, we'll
learn more queries and functions in SQL that will give you some new tools to work with.

SQL dialects and their uses


In this reading, you will learn more about SQL dialects and some of their different uses.
As a quick refresher, Structured Query Language, or SQL, is a language used to
talk to databases. Learning SQL can be a lot like learning a new language—including
the fact that languages usually have different dialects within them. Some database
products have their own variant of SQL, and these different varieties of SQL dialects are
what help you communicate with each database product.

These dialects will be different from company to company and might change over time if
the company moves to another database system. So, a lot of analysts start with
Standard SQL and then adjust the dialect they use based on what database they are
working with. Standard SQL works with a majority of databases and requires a small
number of syntax changes to adapt to other dialects.

As a junior data analyst, it is important to know that there are slight differences between
dialects. But by mastering Standard SQL, which is the dialect you will be working with in
this program, you will be prepared to use SQL in any database.

More information

You may not need to know every SQL dialect, but it is useful to know that these different
dialects exist. If you are interested in learning more about SQL dialects and when they
are used, you can check out these resources for more information:

 LearnSQL’s blog, What Is a SQL Dialect, and Which One Should You
Learn?
 Software Testing Help’s article, Differences Between SQL Vs MySQL vs SQL
Server
 Datacamp’s blog, SQL Server, PostgreSQL, MySQL... what's the
difference? Where do I start? Note that there is an error in this blog article. The

92
comparison table incorrectly states that SQlite uses subqueries instead of window
functions. Refer to the SQLite Window Functions documentation for proper
clarification.
 SQL Tutorial’s tutorial, What is SQL

Review: Set up your BigQuery


account
Note: This reading is also in Courses 3 and 5 of this program. If you’re taking the
courses in order, you may either review it or move on to the next new course item,
Hands-On Activity: Processing time with SQL. If you haven’t taken Courses 3 or 5 you
should complete this reading before proceeding to the next course item.

As you’ve been learning, BigQuery is a database you can use to access, explore, and
analyze data from many sources. Now, you’ll begin using BigQuery, which will help you
gain SQL knowledge by typing out commands and troubleshooting errors. This reading
will guide you through the process of setting up your very own BigQuery account.

Note: Working with BigQuery is not a requirement of this program. Additional


resources for other SQL database platforms are also provided at the end of this reading
if you choose to use them instead.

BigQuery account options

BigQuery offers a variety of account tiers to cater to various user needs and has two
free-of-charge entry points, a sandbox account and a free-of-charge trial account.
These options allow you to explore the program before selecting the best choice to suit
your needs. A sandbox account allows you to practice writing queries and to explore
public datasets free of charge, but it has quotas and limits, as well as some additional
restrictions. If you prefer to use BigQuery with the standard limits, you can set up a free-
of-charge trial account instead. The free-of-charge trial is a trial period prior to paying for
a subscription. In this instance, there is no automatic charge, but you will be asked for
payment information when you create the account.

93
This reading provides instructions for setting up either account type. An effective first
step is to begin with a sandbox account and switch to a free-of-charge trial account
when needed to run the SQL presented upcoming courses.

Sandbox account

The sandbox account is available at no cost, and anyone with a Google account can
use it. However, it does have some limitations. For instance, you are limited to a
maximum of 12 projects at a time. This means that, to create a 13th project, you'll need
to delete one of your existing 12 projects. Additionally, the sandbox account doesn't
support all operations you’ll do in this program. For example, there are limits on the
amount of data you can process and you can’t insert new records into a database or
update the values of existing records. However, a sandbox account is perfect for most
program activities, including all of the activities in this course. Additionally, you can
convert your sandbox account into a free-of-charge trial account at any time.

Set up your sandbox account

To set up a sandbox account:

1. Visit the BigQuery sandbox documentation page.


2. Log in to your preferred Google account by selecting the profile icon in the BigQuery
menu bar.
3. Select the Go to BigQuery button on the documentation page.
4. You'll be prompted to select your country and read the terms of service agreement.
5. This will bring you to the SQL Workspace, where you'll be conducting upcoming
activities. By default, BigQuery creates a project for you.

After you set up your account, the name of the project will be in the banner in your
BigQuery console.

Free-of-charge trial

If you wish to explore more of BigQuery's capabilities with fewer limitations, consider the
Google Cloud Free Trial. It provides you with $300 in credit for Google Cloud usage
during the first 90 days. If you're primarily using BigQuery for SQL queries, you're
unlikely to come close to this spending limit. After you've used up the $300 credit or
after 90 days, your free trial will expire, and you will only be able to use this account if
94
you pay to do so. Google won't automatically charge your payment method when the
trial ends. However, you'll need to set up a payment option with Google Cloud. This
means that you’ll need to enter your financial information. Rest assured, it won't charge
you unless you consciously opt to upgrade to a paid account. If you're uncomfortable
providing payment information, don't worry; you can use the BigQuery sandbox account
instead.

Set up your free-of-charge trial

1. Go to the BigQuery page.


2. Select Try BigQuery free.
3. Log in using your Google email, or create an account free of charge if you don't have
one. Click here to create an account.
4. Select your country, a description of your organization or needs, and the checkbox to
accept the terms of service, Then select CONTINUE.
5. Enter your billing information and select START MY FREE TRIAL.

After you set up your account, your first project, titled My First Project will be in the
banner.

Transferring between BigQuery accounts

With either a sandbox or free-of-charge trial account, you have the flexibility to upgrade
to a paid account at any time. If you upgrade, all your existing projects will be retained
and transferred to your new account. If you started with a free-of-charge trial, but
choose not to upgrade when it ends, you can switch to a sandbox account. However,
note that projects from your trial won't transfer to your sandbox. Essentially, creating a
sandbox is like starting from scratch.

Get started with other databases (if not using BigQuery)

It’s easiest to follow along with the course activities if you use BigQuery, but you may
use other SQL platforms, if you prefer. If you decide to practice SQL queries on other
database platforms, here are some resources to get started:

 Getting Started with MySQL

95
 Getting Started with Microsoft SQL Server
 Getting Started with PostgreSQL
 Getting Started with SQLite

Key takeaways

BigQuery offers multiple account options. Keep the following in mind when you choose
an account type:

 Account tiers: BigQuery provides various account tiers to cater to a wide range of
user requirements. Whether you're starting with a sandbox account or exploring a paid
account with the free-of-charge trial option, BigQuery offers flexibility to choose the
option that aligns best with your needs and budget.
 Sandbox limitations: While a sandbox account is a great starting point, it comes
with some limitations, such as a cap on the number of projects and restrictions on data
manipulation operations like inserting or updating records, which you will encounter later
in this program. Be aware of these limitations if you choose to work through this course
using a sandbox account.
 Easy setup and upgrades: Getting started with any BigQuery account type is
quick and easy. And if your needs evolve, you have the flexibility to modify your account
status at any time. Additionally, projects can be retained even when transitioning
between account types.

Choose the right BigQuery account type to match your specific needs and adapt as your
requirements change!

Review: Get started with


BigQuery
Note: This reading is also in Courses 3 and 5 of this program. If you’re taking the
courses in order, you may either review it or move on to the next course item, Hands-
On Activity: Processing time with SQL. If you haven’t taken Courses 3 or 5 you should
complete this reading before proceeding to the next course item.

96
BigQuery is a data warehouse on the Google Cloud Platform used to query and filter
large datasets, aggregate results, and perform complex operations. Throughout this
program, you’re going to use BigQuery to practice your SQL skills and collect, prepare,
and analyze data. At this point, you have set up your own account. Now, explore some
of the important elements of the SQL workspace. This will prepare you for the upcoming
activities in which you will use BigQuery. Note that BigQuery updates its interface
frequently, so your console might be slightly different from what is described in this
reading. That’s okay; use your troubleshooting skills to find what you need!

Log in to BigQuery

When you log in to BigQuery using the landing page, you will automatically open your
project space. This is a high-level overview of your project, including the project
information and the current resources being used. From here, you can check your
recent activity.

Navigate to your project’s BigQuery Studio by selecting BigQuery from the navigation
menu and BigQuery Studio from the dropdown menu.

BiqQuery Studio components

Once you have navigated to BigQuery from the project space, most of the major
components of the BigQuery console will be present: the Navigation pane, the
Explorer pane, and the SQL Workspace.

The Navigation pane

On the console page, find the Navigation pane. This is how you navigate from the
project space to the BigQuery tool. This menu also contains a list of other Google Cloud
Project (GCP) data tools. During this program, you will focus on BigQuery, but it’s useful
to understand that the GCP has a collection of connected tools data professionals use
every day.

97
The Explorer pane

The Explorer pane lists your current projects and any starred projects you have
added to your console. It’s also where you’ll find the + ADD button, which you can use
to add datasets.

This button opens the Add dialog that allows you to open or import a variety of
datasets.

Add Public Datasets

BigQuery offers a variety of public datasets from the Google Cloud Public Dataset
Program. Scroll down the Add dialog to the Public Datasets option.

Select Public Datasets. This takes you to the Public Datasets Marketplace,
where you can search for and select public datasets to add to your BigQuery console.
For example, search for the "noaa lightning" dataset in the Marketplace search bar.
When you search for this dataset, you will find NOAA’s Cloud-to-Ground Lightning
Strikes data.

Select the dataset to read its description. Select View dataset to create a tab of the
dataset’s information within the SQL workspace.

The Explorer Pane lists the noaa_lightning and other public datasets.

Star and examine Public Datasets

You added the public noaa_lightning dataset to your BigQuery Workspace, so the
Explorer pane displays the noaa_lightning dataset, along with the list of other
public datasets. These datasets are nested under bigquery-public-data. Star
bigquery-public-data by navigating to the top of the Explorer pane and selecting
the star next to bigquery-public-data.

98
Starring bigquery-public-data will enable you to search for and add public datasets
by scrolling in the Explorer pane or by searching for them in the Explorer search
bar.

For example, you might want to select a different public dataset. If you select the
second dataset, austin_311, it will expand to list the table stored in it,
311_service_requests.

The Explorer pane with the “bigquery-public data” and “austin_311” datasets expanded,
revealing the “311_service_requests” table

When you select a table, its information is displayed in the SQL Workspace. Select the
311_service_requests table to examine several tabs that describe it, including:

 Schema, which displays the column names in the dataset


 Details, which contains additional metadata, such as the creation date of the dataset
 Preview, which shows the first rows from the dataset

Additionally, you can select the Query button from the menu bar in the SQL
Workspace to query this table.

The SQL Workspace

The final menu pane in your console is the SQL Workspace. This is where you will
actually write and execute queries in BigQuery.

The SQL Workspace also gives you access to your personal and project history, which
stores a record of the queries you’ve run. This can be useful if you want to return to a
query to run it again or use part of it in another query.

Upload your data

In addition to offering access to public datasets, BigQuery also gives you the ability to
upload your own data directly into your workspace. Access this feature by opening the
+ ADD menu again or by clicking the three vertical dots next to your project’s name in
the Explorer pane. This will give you the option to create your own dataset and upload

99
your own tables. You will have the opportunity to upload your own data in an upcoming
activity to practice using this feature!

Key takeaways

BigQuery's SQL workspace allows you to search for public datasets, run SQL queries,
and even upload your own data for analysis. Whether you're working with public
datasets, running SQL queries, or uploading your own data, BigQuery’s SQL workspace
offers a range of features to support all kinds of data analysis tasks. Throughout this
program, you will be using BigQuery to practice your SQL skills, so being familiar with
the major components of your BigQuery console will help you navigate it effectively in
the future!

Optional: Upload the customer


dataset to BigQuery
In the next video, the instructor uses a specific dataset. The instructions in this reading
are provided for you to upload the same dataset in your BigQuery console.

You must have a BigQuery account to follow along.

Prepare for the next video

 First, download the .csv file from the attachment below.

 Next, complete the following steps in your BigQuery console to upload the Customer
Table dataset.
Step 1: Open your BigQuery console and click on the project you want to upload the
data to.

Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next
to your project name and select Create dataset.

100
Step 3: In the upcoming video, the name "customer_data" will be used for the dataset.
If you plan to follow along with the video, enter customer_data for the Dataset ID.

101
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.

102
Step 5: In the Explorer on the left, click to expand your project, and then click the
customer_data dataset you just created.

Step 6: Click the Actions icon (three vertical dots) next to customer_data and
select Open.

Step 7: Click the blue + icon at the middle to open the Create table window.

Step 8: Under Source, for the Create table from selection, choose where the
data will be coming from.

 Select Upload.
 Click Browse to select the Customer Table .csv file you downloaded.
 Choose CSV from the file format drop-down.

Step 9: For Table name, enter customer_address if you plan to follow along
with the video.

Step 10: For Schema, click the Auto detect check box.

Step 11: Click Create table (blue button). You will now see the
customer_address table under your customer_data dataset in your project.

Step 12: Click customer_address and then select the Preview tab. Confirm that
you see the data shown below.

103
104
And now you have everything you need to follow along with the next video. This is also
a great table to use to practice querying data on your own. Plus, you can use these
steps to upload any other data you want to work with.

Widely used SQL queries


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hey, welcome back. So far we've learned that SQL has some of the same tools as
spreadsheets, but on a much larger scale. In this video, we'll learn some of the most widely
used SQL queries that you can start using for your own data cleaning and eventual analysis.
Let's get started. We've talked about queries as requests you put into the database to ask it to
do things for you. Queries are a big part of using SQL. It's Structured Query Language, after all.

Queries can help you do a lot of things, but there are some common ones that data
analysts use all the time. So let's start there. First, I'll show you how to use the SELECT
query. I've called this one out before, but now I'll add some new things for us to try out. Right
now, the table viewer is blank because we haven't pulled anything from the database yet. For
this example, the store we're working with is hosting a giveaway for customers in certain
cities. We have a database containing customer information that we can use to narrow down
which customers are eligible for the giveaway. Let's do that now.

We can use SELECT to specify exactly what data we want to interact with in a table. If we
combine SELECT with FROM, we can pull data from any table in this database as long as
they know what the columns and rows are named. We might want to pull the data
about customer names and cities from one of the tables. To do that, we can input SELECT
name, comma, city FROM customer underscore data dot customer underscore address. To get
this information from the customer underscore address table, which lives in the customer

105
underscore data, data set. SELECT and FROM help specify what data we want to extract from
the database and use. We can also insert new data into a database or update existing data.

For example, maybe we have a new customer that we want to insert into this table. We can use
the INSERT INTO query to put that information in. Let's start with where we're trying to insert
this data, the customer underscore address table. We also want to specify which columns we're
adding this data to by typing their names in the parentheses. That way, SQL can tell the
database exactly where we were inputting new information. Then we'll tell it what values we're
putting in. Run the query, and just like that, it added it to our table for us.

Now, let's say we just need to change the address of a customer. Well, we can tell the database
to update it for us. To do that, we need to tell it we're trying to update the customer underscore
address table. Then we need to let it know what value we're trying to change. But we also need
to tell it where we're making that change specifically so that it doesn't change every address in
the table. There. Now this one customer's address has been updated.

If we want to create a new table for this database, we can use the CREATE TABLE IF NOT
EXISTS statement. Keep in mind, just running a SQL query doesn't actually create a table for
the data we extract. It just stores it in our local memory. To save it, we'll need to download it
as a spreadsheet or save the result into a new table. As a data analyst, there are a few
situations where you might need to do just that. It really depends on what kind of data you're
pulling and how often. If you're only using a total number of customers, you probably don't need
a CSV file or a new table in your database.

If you're using the total number of customers per day to do something like track a weekend
promotion in a store, you might download that data as a CSV file so you can visualize it in a
spreadsheet. But if you're being asked to pull this trend on a regular basis, you can create a
table that will automatically refresh with the query you've written. That way, you can directly
download the results whenever you need them for a report. Another good thing to keep in
mind, if you're creating lots of tables within a database, you'll want to use the DROP TABLE IF
EXISTS statement to clean up after yourself. It's good housekeeping. You probably won't be
deleting existing tables very often. After all, that's the company's data, and you don't want to
delete important data from their database.

But you can make sure you're cleaning up the tables you've personally made so that there aren't
old or unused tables with redundant information cluttering the database. There. Now you've

106
seen some of the most widely used SQL queries in action. There's definitely more query
keywords for you to learn and unique combinations that'll help you work within databases. But
this is a great place to start. Coming up, we'll learn even more about queries in SQL and how to
use them to clean our data. See you next time.

Clean string variables using SQL


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

It's so great to have you back. Now that we know some basic SQL queries and spent some time
working in a database, let's apply that knowledge to something else we've been talking
about; preparing and cleaning data. You already know that cleaning and completing your data
before you analyze it is an important step. In this video, I'll show you some ways SQL can help
you do just that. Including how to remove duplicates as well as four functions to help you clean
string variables. Earlier we covered how to remove duplicates and spreadsheets using the
Remove duplicates tool. In SQL, we can do the same thing by including Distinct in our Select
statement.

For example, let's say the company we work for has a special promotion for customers in
Ohio. We want to get the customer IDs of customers who live in Ohio, but some customer
information has been entered multiple times. We can get these customer IDs by writing, select
customer_id from customer_data.customer_address. This query will give us duplicates if they
exist in the table. If customer ID 9080 shows up three times in our table, our results will have
three of that customer ID, but we don't want that. We want a list of all unique customer IDs. To
do that, we add Distinct to our Select statement by writing, Select Distinct customer_ id from
customer_data.customer_address.

107
Now the customer ID 9080 will show up only once in our results. You might remember we talked
before about text strings as a group of characters within a cell, commonly composed of letters,
numbers or both. These text strings need to be clean sometimes. Maybe they've been entered
differently in different places across your database, and now they don't match. In those cases,
you'll need to clean them before you can analyze them. Here are some functions you can use
in SQL to handle string variables. You might recognize some of these functions from when we
talked about spreadsheets.

Now it's time to see them work in a new way. Pull up the data set we shared right before this
video, and you can follow along step by step with me during the rest of this video. The first
function I want to show you is length, which we've encountered before. If we already know the
length our string variables are supposed to be, we can use length to double check that our
string variables are consistent. For some databases, this query is written as len, but it does the
same thing. Let's say we're working with the customer_address table from our earlier
example, we can make sure that all country codes have the same length by using Length on
each of these strings. To write our SQL query, let's first start with Select and From.

We know our data comes from the customer_address table within the customer_data
dataset. We add customer_ [Link] _address after the From clause. Then under
Select, we'll write Length, and then the column we want to check; country. To remind ourselves
what this is, we can label this column in our results as letter_in _country. We add as
letters_in _country after Length parentheses country. The result we get is a list of the number
of letters in each country listed for each of our customers. It seems that almost all of them are
twos, which means the country field contains only two letters, but we notice one that has three.

That's not good. We want our data to be consistent. Let's check out which countries were
incorrectly listed in our table. We can do that by putting the Length parenthesis, country
parentheses function that we created into the Where clause, because we're telling SQL to filter
the data to show only customers whose country contains more than two letters. Now we'll write,
Select country from customer_data.customer_address, where Length parentheses,
country parentheses greater than 2. When we run this query, we now get the two countries
where the number of letters is greater than the 2 we expect to find. The incorrectly listed
countries show up as USA instead of US.

108
If we created this table, then we could update our table so that this entries shows up as US
instead of USA. But in this case, we didn't create this table, so we shouldn't update it. We still
need to fix this problem, so we can pull a list of all the customers in the US, including the two
that have USA instead of US. The good news, is that, we can account for this error in our results
by using the sub-string function in our SQL query. To write our SQL query, let's start by writing
the basic structure. Select, From, Where. We know our data is coming from the
customer_address table from the customer_data dataset.

We type in customer_data.customer _ address. After from. Next, we tell SQL what data we want
it to give us. We want all the customers in the US by their IDs, so we type into customer_id after
select. Finally, we want SQL to filter out only American customers, so we use the substring
function after the where clause. We're going to use the substring function to pull the first two
letters of each country so that all of them are consistent and only contain two letters. To use the
substring function, we first need to tell SQL, the column where we found this error country.

Then we specify which letter to start with. We want SQL to pull the first two letters, so we're
starting with the first letter, so we type in one. Then we need to tell SQL how many
letters, including this first letter to pull. Since we want the first two letters, we need SQL to
pull two total letters, so we type in two. This will give us the first two letters of each country. We
want US only, so we'll set this function to equals US. When we run this query, we get a list of all
customer IDs of customers whose country is the US, including the customers that had USA
instead of US.

Going through our results, it seems like we have a couple duplicates where the customer ID is
shown multiple times. Remember how we get rid of duplicates. We add distinct before customer
underscore ID. Now when we run this query, we have our final list of customer IDs of the
customers who live in the US. Finally, let's check out the trim function which you've come across
before. This is really useful if you find entries with extra spaces and need to eliminate those
extra spaces for consistency. For example, let's check out the state column in our
customer_address table.

Just like we did for the country column, we want to make sure the state column has the
consistent number of letters. Let's use the length function again to learn if we have any state
that has more than two letters, which is what we would expect to find in our data table. We start
writing our SQL query by typing the basic SQL structure of , select from where. We're working

109
with the customer_address table in the customer_data dataset, so we type in
customer_data, dot customer_address after from. Next, we tell SQL what we want it to pull. We
want it to give us any state that has more than two letters, so we type in state after
select. Finally, we want SQL to filter for states that have more than two letters.

This condition is written in the where clause. We type in length, parentheses state,
parentheses, and that it must be greater than two because we want the states that have more
than two letters. We want to figure out what the incorrectly listed states look like, if we have
any. When we run this query, we get one result. We have one state that has more than two
letters. But hold on. How can this state that seems like it has two letters, O and H for Ohio have
more than two letters?

We know that there are more than two characters because we use the length parentheses
state, parentheses greater than two statement in the where clause when filtering our
results. That means the extra characters that SQL is counting must then be a space. There
must be a space after the age. This is where we would use the trim function. The trim function
removes any spaces. Let's write a SQL query that accounts for this error. Let's say we want a
list of all customer IDs of the customers who live in, OH, for Ohio.

We start with the basic SQL structure. Select from where. We know the data comes from, the
customer_address table, and the customer_data dataset. We type in
customer_data.customer_address after from. Next, we tell SQL what data we want. We want
SQL to give us the customer IDs of customers who live in Ohio. We type in customer_id after
select.

Since we know we have some duplicate customer entries, we'll go ahead and type in distinct
before customer ID to remove any duplicate customer IDs from appearing in our results. Finally,
we want SQL to give us the customer IDs of the customers who live in Ohio. We're asking SQL
to filter the data. This belongs in the where clause. Here's where we'll use the trim function. To
use the trim function, we tell SQL the column we want to remove spaces from, which is state in
our case, and we want only Ohio customers, so we type in equals OH. That's it. We have all
customer IDs of the customers who live in Ohio, including that customer with the extra space
after the H. Making sure that your string variables are complete and consistent will save you a
lot of time later by avoiding errors or miscalculations.

110
That's why we clean data in the first place. Hopefully, functions like length, substring, and trim
will give you the tools you need to start working with string variables in your own datasets. Next
up, we'll check out some other ways you can work with strings and more advanced cleaning
functions. Then you'll be ready to start working in SQL on your own. See you soon.

Optional: Upload the store


transactions dataset to
BigQuery
In the next video, the instructor uses a specific dataset. The instructions in this reading
are provided for you to upload the same dataset in your BigQuery console so you can
follow along.

You must have a BigQuery account to follow along.

Prepare for the next video

 First, download the .csv file from the attachment below.

 Next, complete the steps below in your BigQuery console to upload the Store
Transaction dataset.
Note: These steps will be different from what you performed before. In previous
instances, you selected the Auto detect check box to allow BigQuery to auto-detect
the schema. This time, you will choose to create the schema by editing it as text. This
method can be used when BigQuery doesn't automatically set the desired type for a
particular field. In this case, you will specify STRING instead of FLOAT as the type for the
purchase_price field.

Step 1: Open your BigQuery console and select the expansion arrow next to the
project you'll upload the data to. Examine the datasets listed under your project. If a
customer_data dataset is listed, go to step 5; otherwise, continue with step 2.

111
Step 2: In the Explorer on the left, click the Actions icon (three vertical dots) next
to your project name and select Create dataset.

Step 3: In the Create dataset window, enter customer_data for the Dataset ID.
Make sure the Location type is set to Multi-region, US (Multiple regions in
United States), and all the default Advanced options remain set to the Google-
managed encryption key option.

112
Step 4: Click CREATE DATASET (blue button) to add the dataset to your project.

Step 5: In the Explorer pane, click on the expansion arrow under your project name,
and then click the customer_data dataset.

113
Step 6: In the Dataset info page, click the blue + CREATE TABLE button to
open the Create table window.

Step 7: Under Source, for the Create table from selection, choose where the
data will be coming from.

 Select Upload.
 Click Browse to select the Store Transaction Table .csv file you downloaded.
 Choose CSV from the file format drop-down.

114
Step 8: For Table name, enter customer_purchase.

Step 9: For Schema, click the toggle switch for Edit as text. This opens up a box
for the text.

115
Step 10: Copy and paste the following text into the box. Be sure to include the
opening and closing brackets. They are required.

[ { "description": "date", "mode": "NULLABLE", "name": "date", "type": "DATETIME" },


{ "description": "transaction id", "mode": "NULLABLE", "name": "transaction_id", "type":
"INTEGER" }, { "description": "customer id", "mode": "NULLABLE", "name":
"customer_id", "type": "INTEGER" }, { "description": "product name", "mode":
"NULLABLE", "name": "product", "type": "STRING" }, { "description": "product_code",
"mode": "NULLABLE", "name": "product_code", "type": "STRING" }, { "description":
"product color", "mode": "NULLABLE", "name": "product_color", "type": "STRING" },
{ "description": "product price", "mode": "NULLABLE", "name": "product_price", "type":
"FLOAT" }, { "description": "quantity purchased", "mode": "NULLABLE", "name":
"purchase_size", "type": "INTEGER" }, { "description": "purchase price", "mode":
"NULLABLE", "name": "purchase_price", "type": "STRING" }, { "description": "revenue",
"mode": "NULLABLE", "name": "revenue", "type": "FLOAT" } ]

Step 11: Scroll down and expand the Advanced options section.

Step 12: For the Header rows to skip field, enter 1.

NOTE: It is very important that you don't skip the last step, or you will receive 'parsing'
errors, as BigQuery will try to apply the schema editing functions to the title row.

116
Step 13: Click Create table (blue button). You will now see the
customer_purchase table under your customer_data dataset in your
Explorer pane.

Step 14: Click the customer_purchase table and in the Schema tab, confirm
that the schema matches the schema shown below.

117
118
Step 15: Click the Preview tab and confirm that your data matches the data shown
below.

119
120
Congratulations, you are now ready to follow along with the video!

Advanced data-cleaning functions, part 1


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hi there and welcome back. So far we've gone over some basic SQL queries and functions that
can help you clean your data. We've also checked out some ways you can deal with string
variables in SQL to make your job easier. Get ready to learn more functions for dealing with
strings in SQL. Trust me, these functions will be really helpful in your work as a data analyst. In
this video, we'll check out strings again and learn how to use the cast function to correctly
format data. When you import data that doesn't already exist in your SQL tables, the data types
from the new dataset might not have been imported correctly.

This is where the CAST function comes in handy. Basically, CAST can be used to convert
anything from one data type to another. Let's check out an example. Imagine we're working with
Lauren's Furniture Store. The owner has been collecting transaction data for the past year,
but she just discovered that they can't actually organize their data because it hadn't been
formatted correctly. So we'll help her by converting her data to make it useful again. For
example, let's say we want to sort all purchases by purchase_price in descending order.

That means we want the most expensive purchase to show up first in our results. To write the
SQL query, we start with the basic SQL structure. SELECT, FROM, WHERE, we know the data
is stored in the customer_purchase table in the customer_dataset. So we write
customer_data.customer_purchase after FROM. Next, we tell SQL what data to give us in the
select clause. We want to see the purchase_price data, so we type purchase_price after
SELECT. Next is the where clause.

121
We are not filtering out any data since we want all purchase prices shown, so we can take out
the where clause. Finally, to sort the purchase_price in descending order, we type ORDER
BY purchase_price DESC at the end of our query. Let's run this query. We see that 89.85
shows up at the top with 799.99 below it, but we know that 799.99 is a bigger number than
89.85. The database doesn't recognize that these are numbers, so it didn't sort them that way. If
we go back to the customer_purchase table and take a look at its schema, we can see what
data type the database thinks purchase_price is. It says here the database thinks
purchase_price is a string, when in fact it is a float, which is a number that contains a decimal.

That is why 89.85 shows up before 799.99. When we sort letters, we start from the first letter
before moving on to the second letter. So if we want to sort the words apple and orange in
descending order, we start with the first letters a and o. Since o comes after a, orange will show
up first, then apple. The database did the same with 89.85 and 799.99. It started with the first
letter, which in this case was 8 and 7 respectively. Since 8 is bigger than 7, the database sorted
89.85 first and then 799.99 because the database treated these as text strings.

The database doesn't recognize these strings as floats because they haven't been typecast to
match that data type yet. Typecasting means converting data from one type to another, which is
what we'll do with the CAST function. We use the CAST function to replace purchase_price with
a new purchase_price that the database recognizes as float instead of string. We start by
replacing purchase_price with CAST. Then we tell SQL the field we want to change, which is
the purchase_price field. Next is a data type we want to change purchase_price to, which is the
FLOAT data type. BigQuery stores numbers in a 64 bit system, so the FLOAT data type is
referenced as float 64 in our query.

This might be slightly different in other SQL platforms, but basically the 64 and float 64 just
indicates that we're casting numbers in the 64 bit system as FLOATs. We also need to sort this
new field so we change purchase_price after ORDER BY to CAST purchase_price as
FLOAT64. This is how we use the cast function to allow SQL to recognize the purchase_price
column as FLOATs instead of text strings. Now we can sort our purchases by
purchase_price. And just like that, Lauren's Furniture Store has data that can actually be used
for analysis. As a data analyst, you'll be asked to locate and organize data a lot, which is why
you want to make sure you convert between data types early on. Businesses like our Furniture
Store are interested in timely sales data, and you need to be able to account for that in your
analysis.

122
The CAST function can be used to change strings into other data types too, like date and
time. As a data analyst, you might find yourself using data from various sources. Part of your job
is making sure the data from those sources is recognizable and usable in your database so that
you won't run into any issues with your analysis. And now you know how to do that. The CAST
function is one great tool you can use when you're cleaning data. And coming up, we'll cover
some other advanced functions that you can add to your toolbox. See you soon.

Advanced data-cleaning functions, part 2


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hey there. Great to see you again. So far we've seen some SQL functions in action. In this
video, we'll go over more uses for CAST and then learn about CONCAT and COALESCE. Let's
get started. Earlier we talked about the CAST function, which lets us type cast text strings into
floats. I called out that the CAST function can be used to change into other data types too.

Let's check out another example of how you can use CAST in your own data work. We've got
the transaction data we were working with from our Lauren's furniture store example, but now
we'll check out the purchase date field. The furniture store owner has asked us to look
at purchases that occurred during their sales promotion period in December. Let's write a SQL
query that will pull date and purchase_price for all purchases that occurred between December
1st, 2020 and December 31st, 2020. We start by writing the basic SQL structure; SELECT,
FROM, WHERE. We know the data comes from the customer_purchase table in the
customer_data data set, so we write customer_data.customer_purchase after FROM. Next we
tell SQL what data to pull.

Since we want date and purchase_price, we add them into the SELECT statement. Finally, we
want SQL to filter for purchases that occurred in December only, so we type date

123
BETWEEN 2020-12-01 and 2020-12-31 in the WHERE clause. Let's run the query. Four
purchases occurred in December, but the date field looks odd. That's because the database
recognizes the date field as date time, which consists of the date and time. Our SQL query still
works correctly even if the date field is date time instead of date. But we can tell SQL to convert
the date field into the date data type so we see just the date and not the time.

To do that, we use the CAST function again. We'll use the CAST function to replace the date
field in our select statement with the new date field that will show the date and not the time. We
can do that by typing CAST and adding the date as the field we want to change, then we tell
SQL the data type we want instead, which is the date data type. There. Now we can
have cleaner results for purchases that occurred during the December sales period. CAST is a
super useful function for cleaning and sorting data, which is why I wanted you to see it in action
one more time. Next up, let's check out the CONCAT function.

CONCAT lets you add strings together to create new text strings that can be used as unique
keys. Going back to our customer_purchase table, we see that the furniture store sells different
colors of the same product. The owner wants to know if customers prefer certain colors so the
owner can manage store inventory accordingly. The problem is the product_code is the same
regardless of the product color. We need to find another way to separate products by color so
we can tell if customers prefer one color over the others. We'll use CONCAT to produce a
unique key that'll help us tell the products apart by color and count them more easily. Let's write
our SQL query by starting with the basic structure.

SELECT, FROM, WHERE. We know our data comes from the customer_purchase table
and the customer_data data set, so we type customer_data.customer_purchase after
FROM. Next we tell SQL what data to pull. We use the CONCAT function here to get that
unique key of product and color. We type CONCAT, the first column we want, product_code,
and the other column we want, product_color. Finally, let's say we want to look at couches, so
we filter for couches by typing product='couch' in the WHERE clause. Now we can count how
many times each couch was purchased and figure out if customers preferred one color over the
others.

With CONCAT, the furniture store can find out which color couches are the most popular and
order more. I've got one last advanced function to show you, COALESCE. COALESCE can be
used to return non-null values in a list. Null values are missing values. If you have a field that's

124
optional in your table, it'll have null in that field for rows that don't have appropriate values to put
there. Let's open the customer_purchase table so I can show you what I mean. In the
customer_purchase table, we can see a couple of rows where product information is missing.

That is why we see nulls there. But for the rows where product name is null, we see that there is
product_code data that we can use instead. We'd prefer SQL to show us the product name, like
bed or couch because it's easier for us to read. But if the product name doesn't exist, we can tell
SQL to give us the product_code instead. That is where the COALESCE function comes into
play. Let's say we wanted a list of all products that were sold. We want to use the product name
column to understand what product was sold.

So we write our SQL query with the basic SQL structure, SELECT, FROM, WHERE. We know
our data comes from customer_purchase table and the customer_data data set, so we type
customer_data.customer_purchase after FROM. Next, we tell SQL the data we want. We want
a list of product names. But if names aren't available, then give us the product code. Here is
where we type COALESCE, then we tell which column to check first, product, and which column
to check second, if the first column is null, product_code. We'll name this new field as
product_info.

Finally, we are not filtering out any data so we can take out the WHERE clause. This gives us
product information for each purchase. Now we have a list of all products that were sold for the
owner to review. COALESCE can save you time when you're making calculations too, by
skipping any null values and keeping your math correct. Those were just some of the advanced
functions you can use to clean your data and get it ready for the next step in the analysis
process. You'll discover more as you continue working in SQL. But that's the end of this video
and this module.

Great work. We've covered a lot of ground. You learned the different data cleaning
functions and spreadsheets and SQL, and the benefits of using SQL to deal with large data
sets. We also added some SQL formulas and functions to your toolkit, and most importantly, we
got to experience some of the ways that SQL can help you get data ready for your
analysis. After this, you'll get to spend some time learning how to verify and report your cleaning
results so that your data is squeaky clean and your stakeholders know it. But before that, you've
got another weekly challenge to tackle. You've got this.

125
Some of these concepts might seem challenging at first, but they'll become second nature to
you as you progress in your career. It just takes time and practice. Speaking of practice, feel
free to go back to any of these videos and re-watch or even try some of these commands on
your own. Good luck and I'll see you again when you're ready.

Glossary terms from module 3


Terms and definitions for Course 4, Module 3

CAST: A SQL function that converts data from one datatype to another

COALESCE: A SQL function that returns non-null values in a list

CONCAT: A SQL function that adds strings together to create new text strings that can
be used as unique keys

DISTINCT: A keyword that is added to a SQL SELECT statement to retrieve only non-
duplicate entries

Float: A number that contains a decimal

Substring: A subset of a text string

Typecasting: Converting data from one type to another

Module 04

Verify and report results


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

126
Hi there, great to have you back. You've been learning a lot about the importance of clean data
and explored some tools and strategies to help you throughout the cleaning process. In these
videos, we'll be covering the next step in the process: verifying and reporting on the integrity of
your clean data. Verification is a process to confirm that a data cleaning effort was
well- executed and the resulting data is accurate and reliable. It involves rechecking your clean
dataset, doing some manual clean ups if needed, and taking a moment to sit back and
really think about the original purpose of the project. That way, you can be confident that the
data you collected is credible and appropriate for your purposes. Making sure your data is
properly verified is so important because it allows you to double-check that the work you did
to clean up your data was thorough and accurate.

For example, you might have referenced an incorrect cellphone number or accidentally keyed in
a typo. Verification lets you catch mistakes before you begin analysis. Without it, any insights
you gain from analysis can't be trusted for decision-making. You might even risk
misrepresenting populations or damaging the outcome of a product that you're actually trying to
improve. I remember working on a project where I thought the data I had was sparkling clean
because I'd use all the right tools and processes, but when I went through the steps to verify the
data's integrity, I discovered a semicolon that I had forgotten to remove. Sounds like a really tiny
error, I know, but if I hadn't caught the semicolon during verification and removed it, it would
have led to some big changes in my results. That, of course, could have led to different
business decisions.

There's an example of why verification is so crucial. But that's not all. The other big part of the
verification process is reporting on your efforts. Open communication is a lifeline for any data
analytics project. Reports are a super effective way to show your team that you're being 100
percent transparent about your data cleaning. Reporting is also a great opportunity to show
stakeholders that you're accountable, build trust with your team, and make sure you're all on the
same page of important project details. Coming up, you'll learn different strategies for
reporting, like creating data- cleaning reports, documenting your cleaning process, and using
something called the changelog.

A changelog is a file containing a chronologically ordered list of modifications made to a


project. It's usually organized by version and includes the date followed by a list of
added, improved, and removed features. Changelogs are very useful for keeping track of how a
dataset evolved over the course of a project. They're also another great way to communicate

127
and report on data to others. Along the way, you'll also see some examples of how verification
and reporting can help you avoid repeating mistakes and save you and your team time. Ready
to get started? Let's go!

Confirm data-cleaning meets business


expectations
en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

In this video, we'll discuss how to begin the process of verifying your data-cleaning
efforts. Verification is a critical part of any analysis project. Without it you have no way of
knowing that your insights can be relied on for data-driven decision-making. Think of verification
as a stamp of approval. To refresh your memory, verification is a process to confirm that a data-
cleaning effort was well-executed and the resulting data is accurate and reliable. It also involves
manually cleaning data to compare your expectations with what's actually present. The first step
in the verification process is going back to your original unclean data set and comparing it to
what you have now.

Review the dirty data and try to identify any common problems. For example, maybe you had a
lot of nulls. In that case, you check your clean data to ensure no nulls are present. To do that,
you could search through the data manually or use tools like conditional formatting or filters. Or
maybe there was a common misspelling like someone keying in the name of a product
incorrectly over and over again. In that case, you'd run a FIND in your clean data to make sure
no instances of the misspelled word occur. Another key part of verification involves taking a big-
picture view of your project.

This is an opportunity to confirm you're actually focusing on the business problem that you need
to solve and the overall project goals and to make sure that your data is actually capable of

128
solving that problem and achieving those goals. It's important to take the time to reset and focus
on the big picture because projects can sometimes evolve or transform over time without us
even realizing it. Maybe an e-commerce company decides to survey 1000 customers to get
information that would be used to improve a product. But as responses begin coming in, the
analysts notice a lot of comments about how unhappy customers are with the e-commerce
website platform altogether. So the analysts start to focus on that. While the customer buying
experience is of course important for any e-commerce business, it wasn't the original objective
of the project. The analysts in this case need to take a moment to pause, refocus, and get back
to solving the original problem.

Taking a big picture view of your project involves doing three things. First, consider the business
problem you're trying to solve with the data. If you've lost sight of the problem, you have no way
of knowing what data belongs in your analysis. Taking a problem-first approach to analytics is
essential at all stages of any project. You need to be certain that your data will actually make it
possible to solve your business problem. Second, you need to consider the goal of the
project. It's not enough just to know that your company wants to analyze customer
feedback about a product.

What you really need to know is that the goal of getting this feedback is to make improvements
to that product. On top of that, you also need to know whether the data you've collected
and cleaned will actually help your company achieve that goal. And third, you need to consider
whether your data is capable of solving the problem and meeting the project objectives. That
means thinking about where the data came from and testing your data collection and cleaning
processes. Sometimes data analysts can be too familiar with their own data, which makes it
easier to miss something or make assumptions. Asking a teammate to review your data from a
fresh perspective and getting feedback from others is very valuable in this stage. This is also the
time to notice if anything sticks out to you as suspicious or potentially problematic in your data.

Again, step back, take a big picture view, and ask yourself, do the numbers make sense? Let's
go back to our e-commerce company example. Imagine an analyst is reviewing the cleaned up
data from the customer satisfaction survey. The survey was originally sent to 1,000 customers,
but what if the analyst discovers that there is more than a thousand responses in the data? This
could mean that one customer figured out a way to take the survey more than once. Or it could
also mean that something went wrong in the data cleaning process, and a field was

129
duplicated. Either way, this is a signal that it's time to go back to the data-cleaning process and
correct the problem.

Verifying your data ensures that the insights you gain from analysis can be trusted. It's an
essential part of data-cleaning that helps companies avoid big mistakes. This is another place
where data analysts can save the day. Coming up, we'll go through the next steps in the data-
cleaning process. See you there.

Step-by-Step: Verification of
data cleaning
This reading outlines the steps the instructor performs in the following video, Verification
of data cleaning. The video demonstrates how to verify cleaned data in both
spreadsheets and SQL.

Keep this step-by-step guide open as you watch the video. It can serve as a helpful
reference if you need additional context or clarification while following the video steps.
This is not a graded activity, but you can complete these steps to practice the skills
demonstrated in the video.

What you’ll need

If you’d like to follow along with the examples in this video, choose a spreadsheet tool.
Google Sheets or Excel are recommended.

To access the spreadsheet the instructor uses in this video, click the link to the template
to create a copy of the dataset. If you don’t have a Google account, download the data
directly from the attachments below.

Link to dataset: Jeff’s Party Planet - Data for Cleaning

OR

130
Note: The SQL table used in this example is not available for this activity.

Example 1: Verify data with spreadsheets

Use spreadsheet tools such as Find and Replace and pivot tables to find, understand,
and fix errors in your spreadsheet.

Use Find and Replace to replace all instances of a mistake

1. Use the Jeff’s Party Planet - Data for Cleaning dataset.


2. From the Edit menu, choose Find and Replace to open the Find and replace
dialog box.
3. In the Find field, enter the misspelled word in the supplier name, Plos.
4. In the Replace with field, enter Plus.
5. Click Replace all to replace all instances of "Plos" with "Plus". Click Done to close
the Find and replace dialog box.
6. Select the Undo button to use a different method to correct this misspelling. This can
also be done with Ctrl (Windows) or Command (Mac) Z.

Use a pivot table to understand errors in a spreadsheet

1. Select the Suppliers column.


2. Select Insert > Pivot Table. In the Create pivot table dialog box, choose
New Sheet then Create.
3. This creates a new tab that is mostly blank.
4. Additionally, the Pivot table editor pane is in the window.
5. Next to Rows. Select Add, then the Suppliers column.
6. Next to Values, select Add then select Suppliers. This adds a value for the
Suppliers column.
7. By default, Google Sheets sets the value to summarize by COUNTA (the total number of
values in a range). This will show how many times each supplier name comes up. It’s a
great way to check for misspellings and other anomalies. Note: Don’t use COUNT,
because COUNT counts only numerical values.
8. When there is only one instance of the misspelled name, manually change it to the
correct spelling.
9. To return to the original sheet, select the Sheet1 tab.

131
Example 2: Use a CASE statement to verify data in SQL
Use CASE statements to correct misspellings in SQL.

1. The SQL table used in this example is not available for download, but if you were
performing a similar query you’d first make sure to load the data in BigQuery.

2. Start your SQL query with the basic structure:

SELECT

FROM

WHERE

3. In the FROM clause, specify the table you're pulling data from after FROM. For
example, project-id.customer_data.customer_name

4. In the SELECT clause, specify the columns you want to return. In this example, you
want customer_id and first_name.

5. However, there is a misspelling in a customer’s first name.

i. To correct the misspelled name "Tnoy" to "Tony", use a CASE statement.

ii. Enter CASE. On the next line, enter WHEN first_name = 'Tnoy'THEN 'Tony'.
This tells SQL to replace any instances of Tnoy in the first_name column with Tony.

iii. On the next line, add the statement ELSE first_name to keep other names as
they are.

iv. End the statement with END AS cleaned_name.This creates a new column called
cleaned_name that will contain the data cleaned with the CASE statement.

6. Delete the WHERE clause because you don’t want to filter the query.

7. The final statement should be:

132
3

8. This SQL query will correct the misspelled name and leave other names unchanged
in a new column called cleaned_name. Note that this query corrects only the display of
the name; it does not update the table’s data.

Verification of data cleaning


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hey there. In this video, we'll continue building on the verification process. As a quick
reminder, the goal is to ensure that our data-cleaning work was done properly and the results
can be counted on. You want your data to be verified so you know it's 100 percent ready to
go. It's like car companies running tons of tests to make sure a car is safe before it hits the
road. You learned that the first step in verification is returning to your original, unclean
dataset and comparing it to what you have now. This is an opportunity to search for common
problems.

133
After that, you clean up the problems manually. For example, by eliminating extra spaces or
removing an unwanted quotation mark. But there's also some great tools for fixing common
errors automatically, such as TRIM and remove duplicates. Earlier, you learned that TRIM is a
function that removes leading, trailing, and repeated spaces and data. Remove duplicates is a
tool that automatically searches for and eliminates duplicate entries from a spreadsheet. Now
sometimes you had an error that shows up repeatedly, and it can't be resolved with a quick
manual edit or a tool that fixes the problem automatically. In these cases, it's helpful to create a
pivot table.

A pivot table is a data summarization tool that is used in data processing. Pivot tables sort,
reorganize, group, count, total or average data stored in a database. We'll practice that now
using the spreadsheet from a party supply store. Let's say this company was interested in
learning which of its four suppliers is most cost-effective. An analyst pulled this data on the
products the business sells, how many were purchased, which supplier provides them, the cost
of the products, and the ultimate revenue. The data has been cleaned. But during verification,
we noticed that one of the suppliers' names was keyed in incorrectly.

We could just correct the word as "plus," but this might not solve the problem because we don't
know if this was a one-time occurrence or if the problem's repeated throughout the
spreadsheet. There are two ways to answer that question. The first is using Find and
replace. Find and replace is a tool that looks for a specified search term in a spreadsheet and
allows you to replace it with something else. We'll choose Edit. Then Find and replace. We're
trying to find P-L-O-S, the misspelling of "plus" in the supplier's name.

In some cases you might not want to replace the data. You just want to find something. No
problem. Just type the search term, leave the rest of the options as default and click "Done." But
right now we do want to replace it with P-L-U-S. We'll type that in here. Then click "Replace all"
and "Done."

There we go. Our misspelling has been corrected. That was of course the goal. But for now let's
undo our Find and replace so we can practice another way to determine if errors are repeated
throughout a dataset, like with the pivot table. We'll begin by selecting the data we want to
use. Choose column C. Select "Data." Then "Pivot Table."

Choose "New Sheet" and "Create." We know this company has four suppliers. If we count the
suppliers and the number doesn't equal four, we know there's a problem. First, add a row for

134
suppliers. Next, we'll add a value for our suppliers and summarize by COUNTA. COUNTA
counts the total number of values within a specified range. Here we're counting the number of
times a supplier's name appears in column C. Note that there's also function called
COUNT, which only counts the numerical values within a specified range.

If we use it here, the result would be zero. Not what we have in mind. But in other special
applications, COUNT would give us information we want for our current example. As you
continue learning more about formulas and functions, you'll discover more interesting options. If
you want to keep learning, search online for spreadsheet formulas and functions. There's a lot
of great information out there. Our pivot table has counted the number of misspellings, and it
clearly shows that the error occurs just once.

Otherwise our four suppliers are accurately accounted for in our data. Now we can correct the
spelling, and we verify that the rest of the supplier data is clean. This is also useful practice
when querying a database. If you're working in SQL, you can address misspellings using a
CASE statement. The CASE statement goes through one or more conditions and returns a
value as soon as a condition is met. Let's discuss how this works in real life using our
customer_name table. Check out how our customer, Tony Magnolia, shows up as Tony and
Tnoy.

Tony's name was misspelled. Let's say we want a list of our customer IDs and the customer's
first names so we can write personalized notes thanking each customer for their purchase. We
don't want Tony's note to be addressed incorrectly to "Tnoy." Here's where we can use: the
CASE statement. We'll start our query with the basic SQL structure. SELECT, FROM, and
WHERE. We know that data comes from the customer_name table in the customer_data
dataset, so we can add customer underscore data dot customer underscore name after FROM.

Next, we tell SQL what data to pull in the SELECT clause. We want customer_id and
first_name. We can go ahead and add customer underscore ID after SELECT. But for our
customer's first names, we know that Tony was misspelled, so we'll correct that using CASE.
We'll add CASE and then WHEN and type first underscore name equal "Tnoy." Next we'll use
the THEN command and type "Tony," followed by the ELSE command. Here we will type first
underscore name, followed by End As and then we'll type cleaned underscore name.

Finally, we're not filtering our data, so we can eliminate the WHERE clause. As I mentioned, a
CASE statement can cover multiple cases. If we wanted to search for a few more misspelled

135
names, our statement would look similar to the original, with some additional names like
this. There you go. Now that you've learned how you can use spreadsheets and SQL to fix
errors automatically, we'll explore how to keep track of our changes next.

Data-cleaning verification
checklist
This reading will give you a checklist of common problems you can refer to when doing
your data cleaning verification, no matter what tool you are using. When it comes to
data cleaning verification, there is no one-size-fits-all approach or a single checklist that
can be universally applied to all projects. Each project has its own organization and data
requirements that lead to a unique list of things to run through for verification.

Keep in mind, as you receive more data or a better understanding of the project goal(s),
you might want to revisit some or all of these steps.

Correct the most common problems

Make sure you identified the most common problems and corrected them, including:

 Sources of errors: Did you use the right tools and functions to find the source of the
errors in your dataset?
 Null data: Did you search for NULLs using conditional formatting and filters?
 Misspelled words: Did you locate all misspellings?
 Mistyped numbers: Did you double-check that your numeric data has been entered
correctly?
136
 Extra spaces and characters: Did you remove any extra spaces or characters
using the TRIM function?
 Duplicates: Did you remove duplicates in spreadsheets using the Remove
Duplicates function or DISTINCT in SQL?
 Mismatched data types: Did you check that numeric, date, and string data are
typecast correctly?
 Messy (inconsistent) strings: Did you make sure that all of your strings are
consistent and meaningful?
 Messy (inconsistent) date formats: Did you format the dates consistently
throughout your dataset?
 Misleading variable labels (columns): Did you name your columns
meaningfully?
 Truncated data: Did you check for truncated or missing data that needs correction?
 Business Logic: Did you check that the data makes sense given your knowledge of
the business?

Review the goal of your project

Once you have finished these data cleaning tasks, it is a good idea to review the goal of
your project and confirm that your data is still aligned with that goal. This is a continuous
process that you will do throughout your project-- but here are three steps you can keep
in mind while thinking about this:

 Confirm the business problem


 Confirm the goal of the project
 Verify that data can solve the problem and is aligned to the goal

137
Capture cleaning changes
en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hi again. Now that you've learned how to make your data squeaky clean, it's time to address all
the dirt you've left behind. When you clean your data, all the incorrect or outdated information
is gone, leaving you with the highest-quality content. But all those changes you made to the
data are valuable too. In this video, we'll discuss why keeping track of changes is important to
every data project and how to document all your cleaning changes to make sure everyone stays

138
informed. This involves documentation which is the process of tracking changes, additions,
deletions and errors involved in your data cleaning effort. You can think of it like a crime TV
show.

Crime evidence is found at the scene and passed on to the forensics team. They analyze every
inch of the scene and document every step, so they can tell a story with the evidence. A lot of
times, the forensic scientist is called to court to testify about that evidence, and they have a
detailed report to refer to. The same thing applies to data cleaning. Data errors are the
crime, data cleaning is gathering evidence, and documentation is detailing exactly what
happened for peer review or court. Having a record of how a data set evolved does three very
important things. First, it lets us recover data-cleaning errors.

Instead of scratching our heads, trying to remember what we might have done three months
ago, we have a cheat sheet to rely on if we come across the same errors again later. It's also a
good idea to create a clean table rather than overriding your existing table. This way, you still
have the original data in case you need to redo the cleaning. Second, documentation gives you
a way to inform other users of changes you've made. If you ever go on vacation or get
promoted, the analyst who takes over for you will have a reference sheet to check in with. Third,
documentation helps you to determine the quality of the data to be used in analysis. The first
two benefits assume the errors aren't fixable.

But if they are, a record gives the data engineer more information to refer to. It's also a great
warning for ourselves that the data set is full of errors and should be avoided in the future. If the
errors were time-consuming to fix, it might be better to check out alternative data sets that we
can use instead. Data analysts usually use a changelog to access this information. As a
reminder, a changelog is a file containing a chronologically ordered list of modifications made to
a project. You can use and view a changelog in spreadsheets and SQL to achieve similar
results. Let's start with the spreadsheet.

We can use Sheet's version history, which provides a real-time tracker of all the changes and
who made them from individual cells to the entire worksheet. To find this feature, click the File
tab, and then select Version history. In the right panel, choose an earlier version. We can find
who edited the file and the changes they made in the column next to their name. To return to
the current version, go to the top left and click "Back." If you want to check out changes in a

139
specific cell, we can right-click and select Show Edit History. Also, if you want others to be able
to browse a sheet's version history, you'll need to assign permission.

Now let's switch gears and talk about SQL. The way you create and view a changelog with SQL
depends on the software program you're using. Some companies even have their own separate
software that keeps track of changelogs and important SQL queries. This gets pretty
advanced. Essentially, all you have to do is specify exactly what you did and why when you
commit a query to the repository as a new and improved query. This allows the company to
revert back to a previous version if something you've done crashes the system, which has
happened to me before. Another option is to just add comments as you go while you're cleaning
data in SQL.

This will help you construct your changelog after the fact. For now, we'll check out query history,
which tracks all the queries you've run. You can click on any of them to revert back to a
previous version of your query or to bring up an older version to find what you've changed.
Here's what we've got. I'm in the Query history tab. Listed on the bottom right are all the queries
that run by date and time. You can click on this icon to the right of each individual query to bring
it up to the Query editor.

Changelogs like these are a great way to keep yourself on track. It also lets your team get real-
time updates when they want them. But there's another way to keep the communication flowing,
and that's reporting. Stick around, and you'll learn some easy ways to share your documentation
and maybe impress your stakeholders in the process. See you in the next video.

Embrace changelogs
What do engineers, writers, and data analysts have in common? Change.

Engineers use engineering change orders (ECOs) to keep track of new product
design details and proposed changes to existing products. Writers use document
revision histories to keep track of changes to document flow and edits. And data
analysts use changelogs to keep track of data transformation and cleaning. Here are
some examples of these:

140
Automated version control takes you most of the way

Most software applications have a kind of history tracking built in. For example, in
Google sheets, you can check the version history of an entire sheet or an individual cell
and go back to an earlier version. In Microsoft Excel, you can use a feature called
Track Changes. And in BigQuery, you can view the history to check what has
changed.

Here’s how it works:

Google 1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or r

141
Sheets arrow > to move backward and forward in the history as needed.

Microsoft 1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Unde
Excel Track Changes, click the Accept/Reject Changes option to accept or rejec
any change made.

BigQuery Bring up a previous version (without reverting to it) and figure out what changed by
comparing it to the current version.

Changelogs take you down the last mile

A changelog can build on your automated version history by giving you an even more
detailed record of your work. This is where data analysts record all the changes they
make to the data. Here is another way of looking at it. Version histories record what was
done in a data change for a project, but don't tell us why. Changelogs are super useful
for helping us understand the reasons changes have been made. Changelogs have no
set format and you can even make your entries in a blank document. But if you are
using a shared changelog, it is best to agree with other data analysts on the format of all
your log entries.

Typically, a changelog records:

 Data, file, formula, query, or any other component that changed


 Description of what changed
 Date of the change
 Person who made the change
 Person who approved the change
 Version number
 Reason for the change

Let’s say you made a change to a formula in a spreadsheet because you observed it in
another report and you wanted your data to match and be consistent. If you found out
later that the report was actually using the wrong formula, an automated version history
would help you undo the change. But if you also recorded the reason for the change in
a changelog, you could go back to the creators of the report and let them know about
the incorrect formula. If the change happened a while ago, you might not remember

142
who to follow up with. Fortunately, your changelog would have that information ready for
you! By following up, you would ensure data integrity outside your project. You would
also be showing personal integrity as someone who can be trusted with data. That is
the power of a changelog!

Finally, a changelog is important for when lots of changes to a spreadsheet or query


have been made. Imagine an analyst made four changes and the change they want to
revert to is change #2. Instead of clicking the undo feature three times to undo change
#2 (and losing changes #3 and #4), the analyst can undo just change #2 and keep all
the other changes. Now, our example was for just 4 changes, but try to think about how
important that changelog would be if there were hundreds of changes to keep track of.

143
Bonus tip

If an analyst is making changes to an existing SQL query that is shared across the
company, the company most likely uses what is called a version control system.
An example might be a query that pulls daily revenue to build a dashboard for senior
management.

Here's how a version control system affects a change to a query:

1. A company has official versions of important queries in their version control


system.

144
2. An analyst makes sure the most up-to-date version of the query is the one they will
change. This is called syncing
3. The analyst makes a change to the query.
4. The analyst might ask someone to review this change. This is called a code review
and can be informally or formally done. An informal review could be as simple as asking
a senior analyst to take a look at the change.
5. After a reviewer approves the change, the analyst submits the updated version of the
query to a repository in the company's version control system. This is called a code
commit. A best practice is to document exactly what the change was and why it was
made in a comments area. Going back to our example of a query that pulls daily
revenue, a comment might be: Updated revenue to include revenue coming from the
new product, Calypso.
6. After the change is submitted, everyone else in the company will be able to access
and use this new query when they sync to the most up-to-date queries stored in the
version control system.
7. If the query has a problem or business needs change, the analyst can undo the
change to the query using the version control system. The analyst can look at a
chronological list of all changes made to the query and who made each change. Then,
after finding their own change, the analyst can revert to the previous version.
8. The query is back to what it was before the analyst made the change. And everyone at
the company sees this reverted, original query, too.

Key takeaways

Engineers, writers, and data analysts use different methods to keep track of changes
they make to their work. Automated version control, changelogs, and version control
systems are all common tools used to track changes. Changelogs are particularly
useful, as they can be used to record the reasons for changes made to data. This can
help to ensure data integrity and consistency. Version control systems are most
commonly used when making changes to shared queries. They enable analysts to track
any changes made and revert to previous versions if necessary.

Why documentation is important


en

145
Interactive Transcript - Enable basic transcript
mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text press
CTRL + S. To expand your selection you may use CTRL + arrow key. You may contract your
selection using shift + CTRL + arrow key. For screen readers that are incompatible with using
arrow keys for shortcuts, you can replace them with the H J K L keys. Some screen readers may
require using CTRL in conjunction with the alt key

Great, you're back. Let's set the stage. The crime is dirty data. We've gathered the evidence. It's
been cleaned, verified, and cleaned again. Now it's time to present our evidence. We'll retrace
the steps and present our case to our peers.

As we discussed earlier, data cleaning, verifying, and reporting is a lot like crime drama. Now it's
our day in court. Just like a forensic scientist testifies on the stand about the evidence, data
analysts are counted on to present their findings after a data cleaning effort. Earlier, we learned
how to document and track every step of the data cleaning process, which means we have solid
information to pull from. As a quick refresher, documentation is the process of tracking changes,
additions, deletions, and errors involved in a data cleaning effort, changelogs are good example
of this. Since it's staged chronologically, it provides a real-time account of every
modification. Documenting will be a huge time saver for you as a future data analyst.

It's basically a cheatsheet you can refer to if you're working with the similar data set or need to
address similar errors. While your team can view changelogs directly, stakeholders can't and
have to rely on your report to know what you did. Lets check out how we might document our
data cleaning process using example we worked with earlier. In that example, we found that this
association had two instances of the same membership for $500 in its database. We decided to
fix this manually by deleting the duplicate info. There're plenty of ways we could go about
documenting what we did. One common way is to just create a doc listing out the steps we took
and the impact they had.

For example, first on your list would be that you remove the duplicate instance, which
decreased the number of rows from 33 to 32, and lowered the membership total by $500. If we
were working with SQL, we could include a comment in the statement describing the reason
for a change without affecting the execution of the statement. That's something a bit more
advanced, which we'll talk about later. Regardless of how we capture and share our

146
changelogs, we're setting ourselves up for success by being 100 percent transparent about our
data cleaning. This keeps everyone on the same page and shows project stakeholders that we
are accountable for effective processes. In other words, this helps build our credibility as
witnesses who can be trusted to present all the evidence accurately during testimony. For dirty
data, it's an open and shut case.
Why documentation is importantInteractive Transcript - Enable basic transcript mode by pressing the
escape key You may navigate through the transcript using tab. To save a note for a section of text press
CTRL + S. To expand your selection you may use CTRL + arrow key. You may contract your selection using
shift + CTRL + arrow key. For screen readers that are incompatible with using arrow keys for shortcuts,
you can replace them with the H J K L keys. Some screen readers may require using CTRL in conjunction
with the alt key Great, you're back. Let's set the stage. The crime is dirty data. We've gathered the
evidence. It's been cleaned, verified, and cleaned again. Now it's time to present our evidence. We'll
retrace the steps and present our case to our peers. As we discussed earlier, data cleaning, verifying,
and reporting is a lot like crime drama. Now it's our day in court. Just like a forensic scientist testifies on
the stand about the evidence, data analysts are counted on to present their findings after a data
cleaning effort. Earlier, we learned how to document and track every step of the data cleaning
process, which means we have solid information to pull from. As a quick refresher, documentation is the
process of tracking changes, additions, deletions, and errors involved in a data cleaning
effort, changelogs are good example of this. Since it's staged chronologically, it provides a real-time
account of every modification. Documenting will be a huge time saver for you as a future data analyst.
It's basically a cheatsheet you can refer to if you're working with the similar data set or need to address
similar errors. While your team can view changelogs directly, stakeholders can't and have to rely on your
report to know what you did. Lets check out how we might document our data cleaning process using
example we worked with earlier. In that example, we found that this association had two instances of
the same membership for $500 in its database. We decided to fix this manually by deleting the duplicate
info. There're plenty of ways we could go about documenting what we did. One common way is to just
create a doc listing out the steps we took and the impact they had. For example, first on your list would
be that you remove the duplicate instance, which decreased the number of rows from 33 to 32, and
lowered the membership total by $500. If we were working with SQL, we could include a comment
in the statement describing the reason for a change without affecting the execution of the
statement. That's something a bit more advanced, which we'll talk about later. Regardless of how we
capture and share our changelogs, we're setting ourselves up for success by being 100 percent
transparent about our data cleaning. This keeps everyone on the same page and shows project
stakeholders that we are accountable for effective processes. In other words, this helps build our
credibility as witnesses who can be trusted to present all the evidence accurately during testimony. For
dirty data, it's an open and shut case. Process Data from Dirty to Clean Introduction to data integrity
Video. Duration: 3 minutes3 min Course 4 overview Reading. Duration: 8 minutes8 min Helpful
resources and tips Reading. Duration: 4 minutes4 min Why data integrity is important Video. Duration: 2
minutes2 min More about data integrity and compliance Reading. Duration: 8 minutes8 min Balance
objectives with data integrity Video. Duration: 3 minutes3 min Well-aligned objectives and data Reading.
Duration: 8 minutes8 min Test your knowledge on data integrity and analytics objectives Practice
AssignmentGrade: 100% Deal with insufficient data Video. Duration: 3 minutes3 min When you find an
issue with your data Reading. Duration: 4 minutes4 min The importance of sample size Video. Duration:
3 minutes3 min Calculate sample size Reading. Duration: 8 minutes8 min Self-Reflection: Pre-cleaning
activities Practice AssignmentGrade: 100% Test your knowledge on insufficient data Practice

147
AssignmentGrade: 100% Using statistical power Video. Duration: 4 minutes4 min When data isn't readily
available Reading. Duration: 8 minutes8 min Determine the best sample size Video. Duration: 4
minutes4 min Sample size calculator Reading. Duration: 8 minutes8 min Test your knowledge on testing
your data Practice AssignmentGrade: 100% Evaluate data reliability Video. Duration: 5 minutes5 min All
about margin of error Reading. Duration: 8 minutes8 min Test your knowledge on margin of error
Practice AssignmentGrade: 100% Glossary terms from module 1 Reading. Duration: 4 minutes4 min
Module 1 challenge Graded AssignmentGrade: 100% Clean it up! Video. Duration: 2 minutes2 min Why
data cleaning is critical Video. Duration: 5 minutes5 min Angie: I love cleaning data Video. Duration: 1
minute1 min What is dirty data? Reading. Duration: 8 minutes8 min Recognize and remedy dirty data
Video. Duration: 5 minutes5 min Principles of data integrity Ungraded Plugin. Duration: 10 minutes10
min Test your knowledge on data cleaning Practice AssignmentGrade: 100% Data-cleaning tools and
techniques Video. Duration: 6 minutes6 min Clean data from multiple sources Video. Duration: 5
minutes5 min Common data-cleaning pitfalls Reading. Duration: 8 minutes8 min Hands-On Activity:
Cleaning data with spreadsheets Practice AssignmentGrade: 100% Test your knowledge on the first
steps toward clean data Practice Assignment. Duration: 8 minutes8 min Step-by-Step guide: Data-
cleaning features in spreadsheets Reading. Duration: 8 minutes8 min Data-cleaning features in
spreadsheets Video. Duration: 8 minutes8 min Step-by-Step: Optimize the data-cleaning process
Reading. Duration: 8 minutes8 min Optimize the data-cleaning process Video. Duration: 14 minutes14
min Workflow automation Reading. Duration: 8 minutes8 min Step-by-Step: Different data perspectives
Reading. Duration: 8 minutes8 min Different data perspectives Video. Duration: 10 minutes10 min Step-
by-Step: Even more data-cleaning techniques Reading. Duration: 8 minutes8 min Even more data-
cleaning techniques Video. Duration: 6 minutes6 min Working with .csv files Reading. Duration: 4
minutes4 min Hands-On Activity: Clean data with spreadsheet functions Practice Assignment. Duration:
1 hour1h Develop your approach to cleaning data Reading. Duration: 8 minutes8 min Test your
knowledge on cleaning data in spreadsheets Practice Assignment. Duration: 8 minutes8 min Glossary
terms from module 2 Reading. Duration: 4 minutes4 min Module 2 challenge Graded Assignment.
Duration: 40 minutes40 min Use SQL to clean data Video. Duration: 45 seconds45 sec Sally: For the love
of SQL Video. Duration: 3 minutes3 min Understand SQL capabilities Video. Duration: 3 minutes3 min
How a junior data analyst uses SQL Reading. Duration: 4 minutes4 min Spreadsheets versus SQL Video.
Duration: 4 minutes4 min SQL dialects and their uses Reading. Duration: 8 minutes8 min Review: Set up
your BigQuery account Reading. Duration: 8 minutes8 min Review: Get started with BigQuery Reading.
Duration: 8 minutes8 min Hands-On Activity: Processing time with SQL Practice Assignment. Duration: 1
hour1h Optional: Upload the customer dataset to BigQuery Reading. Duration: 4 minutes4 min Widely
used SQL queries Video. Duration: 6 minutes6 min Evan: Having fun with SQL Video. Duration: 2
minutes2 min Clean string variables using SQL Video. Duration: 12 minutes12 min Hands-On Activity:
Clean data using SQL Practice Assignment. Duration: 1 hour1h Test your knowledge on SQL queries
Practice Assignment. Duration: 10 minutes10 min Optional: Upload the store transactions dataset to
BigQuery Reading. Duration: 8 minutes8 min Advanced data-cleaning functions, part 1 Video. Duration:
6 minutes6 min Advanced data-cleaning functions, part 2 Video. Duration: 8 minutes8 min Data-cleaning
with SQL functions Ungraded Plugin. Duration: 10 minutes10 min Self-Reflection: Challenges with SQL
Practice Assignment. Duration: 20 minutes20 min Glossary terms from module 3 Reading. Duration: 2
minutes2 min Module 3 challenge Graded Assignment. Duration: 45 minutes45 min Verify and report
results Video. Duration: 3 minutes3 min Confirm data-cleaning meets business expectations Video.
Duration: 5 minutes5 min Step-by-Step: Verification of data cleaning Reading. Duration: 8 minutes8 min
Verification of data cleaning Video. Duration: 8 minutes8 min Data-cleaning verification checklist
Reading. Duration: 4 minutes4 min Test your knowledge on manual data cleaning Practice Assignment.
Duration: 8 minutes8 min Capture cleaning changes Video. Duration: 5 minutes5 min Embrace
changelogs Reading. Duration: 8 minutes8 min Self-Reflection: Creating a changelog Practice

148
Assignment. Duration: 20 minutes20 min Why documentation is important Video. Duration: 3 minutes3
min Feedback and cleaning Video. Duration: 2 minutes2 min Advanced functions for speedy data
cleaning Reading. Duration: 4 minutes4 min Test your knowledge on documenting the cleaning process
Practice Assignment. Duration: 8 minutes8 min Glossary terms from module 4 Reading. Duration: 2
minutes2 min Module 4 challenge Graded Assignment. Duration: 40 minutes40 min Make your resume
unique Video. Duration: 3 minutes3 min Joseph: Black and African American inclusion in the data
industry Video. Duration: 1 minute1 min The importance of diversity on a data analytics team Reading.
Duration: 4 minutes4 min Add technical skills to your resume Reading. Duration: 8 minutes8 min Where
does your interest lie? Video. Duration: 3 minutes3 min Reflect and connect with peers Reading.
Duration: 4 minutes4 min Course 4 glossary Reading. Duration: 4 minutes4 min Congratulations! Course
wrap-up Video. Duration: 1 minute1 min Coming up next ... Reading. Duration: 4 minutes4 min 0:00 /
3:24 : Added to Selection. Press [CTRL + S] to save as a note : Added to Selection. Press [CTRL + S] to save
as a note

Feedback and cleaning


en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Welcome back. By now it's safe to say that verifying, documenting and reporting are valuable
steps in the data-cleaning process. You have proof to give stakeholders that your data is
accurate and reliable. And the effort to attain it was well-executed and documented. The next
step is getting feedback about the evidence and using it for good, which we'll cover in this
video. Clean data is important to the task at hand. But the data-cleaning process itself can
reveal insights that are helpful to a business.

The feedback we get when we report on our cleaning can transform data collection processes,
and ultimately business development. For example, one of the biggest challenges of working
with data is dealing with errors. Some of the most common errors involve human mistakes like
mistyping or misspelling, flawed processes like poor design of a survey form, and system issues
where older systems integrate data incorrectly. Whatever the reason, data-cleaning can shine a
light on the nature and severity of error-generating processes. With consistent documentation

149
and reporting, we can uncover error patterns in data collection and entry procedures and use
the feedback we get to make sure common errors aren't repeated. Maybe we need to
reprogram the way the data is collected or change specific questions on the survey form. In
more extreme cases, the feedback we get can even send us back to the drawing board to
rethink expectations and possibly update quality control procedures.

For example, sometimes it's useful to schedule a meeting with a data engineer or data owner to
make sure the data is brought in properly and doesn't require constant cleaning. Once errors
have been identified and addressed, stakeholders have data they can trust for decision-
making. And by reducing errors and inefficiencies in data collection, the company just might
discover big increases to its bottom line. Congratulations! You now have the foundation you
need to successfully verify a report on your cleaning results. Stay tuned to keep building on your
new skills.

Advanced functions for speedy


data cleaning
In this reading, you will learn about some advanced functions that can help you speed
up the data cleaning process in spreadsheets. Below is a table summarizing three
functions and what they do:

150
Keeping data clean and in sync with a source

The IMPORTRANGE function in Google Sheets and the Paste Link feature (a
Paste Special option in Microsoft Excel) both allow you to insert data from one sheet to
another. Using these on a large amount of data is more efficient than manual copying
and pasting. They also reduce the chance of errors being introduced by copying and
pasting the wrong data. They are also helpful for data cleaning because you can “cherry
pick” the data you want to analyze and leave behind the data that isn’t relevant to your
project. Basically, it is like canceling noise from your data so you can focus on what is
most important to solve your problem. This functionality is also useful for day-to-day
data monitoring; with it, you can build a tracking spreadsheet to share the relevant data

151
with others. The data is synced with the data source so when the data is updated in the
source file, the tracked data is also refreshed.

In Google Sheets, you can use the IMPORTRANGE function. It enables you to specify a
range of cells in the other spreadsheet to duplicate in the spreadsheet you are working
in. You must allow access to the spreadsheet containing the data the first time you
import the data.

The URL shown below is for syntax purposes only. Don't enter it in
your own spreadsheet. Replace it with a URL to a spreadsheet you
have created so you can control access to it by clicking the Allow
access button.

Refer to the Google support page for IMPORTRANGE for the sample usage and
syntax.

Example of using IMPORTRANGE

An analyst monitoring a fundraiser needs to track and ensure that matching funds are
distributed. They use IMPORTRANGE to pull all the matching transactions into a
spreadsheet containing all of the individual donations. This enables them to determine
which donations eligible for matching funds still need to be processed. Because the total
number of matching transactions increases daily, they simply need to change the range
used by the function to import the most up-to-date data.

On Tuesday, they use the following to import the donor names and matched amounts:

152
=IMPORTRANGE("[Link]
abcd123abcd123", "sheet1!A1:C10", "Matched Funds!A1:B4001")

On Wednesday, another 500 transactions were processed. They increase the range
used by 500 to easily include the latest transactions when importing the data to the
individual donor spreadsheet:

=IMPORTRANGE("[Link]
abcd123abcd123", "Matched Funds!A1:B4501")

Note: The above examples are for illustrative purposes only. Don't
copy and paste them into your spreadsheet. To try it out yourself,
you will need to substitute your own URL (and sheet name if you
have multiple tabs) along with the range of cells in the spreadsheet
that you have populated with data.

Pulling data from other data sources

The QUERY function is also useful when you want to pull data from another
spreadsheet. The QUERY function's SQL-like ability can extract specific data within a
spreadsheet. For a large amount of data, using the QUERY function is faster than filtering
data manually. This is especially true when repeated filtering is required. For example,
you could generate a list of all customers who bought your company’s products in a
particular month using manual filtering. But if you also want to figure out customer
growth month over month, you have to copy the filtered data to a new spreadsheet, filter
the data for sales during the following month, and then copy those results for the
analysis. With the QUERY function, you can get all the data for both months without a
need to change your original dataset or copy results.

The QUERY function syntax is similar to IMPORTRANGE. You enter the sheet by name
and the range of data that you want to query from, and then use the SQL SELECT
command to select the specific columns. You can also add specific criteria after the
SELECT statement by including a WHERE statement. But remember, all of the SQL code
you use has to be placed between the quotes!

Google Sheets run the Google Visualization API Query Language across the data.
Excel spreadsheets use a query wizard to guide you through the steps to connect to a

153
data source and select the tables. In either case, you are able to be sure that the data
imported is verified and clean based on the criteria in the query.

Examples of using QUERY

Check out the Google support page for the QUERY function with sample usage, syntax,
and examples you can download in a Google sheet.

Link to make a copy of the sheet: QUERY examples

The solution

Analysts can use SQL to pull a specific dataset into a spreadsheet. They can then use
the QUERY function to create multiple tabs (views) of that dataset. For example, one tab
could contain all the sales data for a particular month and another tab could contain all
the sales data from a specific region. This solution illustrates how SQL and
spreadsheets are used well together.

Filtering data to get what you want

The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a
query language. The FILTER function lets you view only the rows (or columns) in the
source data that meet your specified conditions. It makes it possible to pre-filter data
before you analyze it.

The FILTER function might run faster than the QUERY function. But keep in mind, the
QUERY function can be combined with other functions for more complex calculations.
For example, the QUERY function can be used with other functions like SUM and COUNT
to summarize data, but the FILTER function can't.

Example of using FILTER

Check out the Google support page for the FILTER function with sample usage, syntax,
and examples you can download in a Google sheet.

Link to make a copy of the sheet: FILTER examples

154
Glossary terms from module 4
Terms and definitions for Course 4, Module 4

CASE: A SQL statement that returns records that meet conditions by including an
if/then statement in a query

Changelog: A file containing a chronologically ordered list of modifications made to a


project

COUNTA: A spreadsheet function that counts the total number of values within a
specified range

Find and replace: A tool that finds a specified search term and replaces it with
something else

Verification: A process to confirm that a data-cleaning effort was well executed and
the resulting data is accurate and reliable

MODULE 05
Make your resume unique
Great to see you again. Building a strong resume is a great way to find success in your job
hunt. You've had the chance to start building your resume, and now we'll take the next step by
showing you how to refine your resume for data analytics jobs. Let's get started. For data
analytics, one of the most important things your resume should do is show that you are a clear
communicator. Companies looking for analysts want to know that the people they hire can do
the analysis, but also can explain it to any audience in a clear and direct way. Your first
audience as a data analyst will most likely be hiring managers and recruiters.

Being direct and coherent in your resume will go a long way with them as well. Let's start with
the summary section. While you won't go into too much detail in this section about any of your
work experiences, it's a good spot to point out if you're transitioning into a new career role. You
might add something like, "transitioning from a career in the auto industry and seeking a full-
time role in the field of data analytics." One strategy you can use in your summary and

155
throughout your resume is P-A-R, or PAR statements. PAR stands for Problem, Action,
Result. This is a great way to help you write clearly and concisely.

Instead of saying something like, "was responsible for writing two blogs a month," you'd say,
"earned little-known website over 2,000 new clicks through strategic blogging." The website
being little-known is the problem. The strategic action is the strategic blogging. And the result is
the 2,000 new clicks. Adding PAR statements to your job descriptions or skill section can help
with the organization and consistency in your resume. They definitely helped me when I
changed jobs. Speaking of the skill section, make sure you include any skills and
qualifications you've acquired through this course and on your own.

You don't need to be super technical. But talking about your experience with
spreadsheets, SQL, Tableau, and R, which is a programming language that we'll get to
later, will enhance your resume and your chances of getting a job. If you're listing qualifications
or skills, you might include a spot for programming languages and then list SQL and R, which
are both a part of the Google Data Analytics certificate. You might even add in the top
functions, packages or formulas that you're comfortable with in each. It also makes sense to
include skills you've acquired in spreadsheets like pivot tables. Pivot tables, SQL, R, and lots of
other terms we covered here might get you noticed by hiring managers and recruiters. But you
definitely want your resume to accurately represent your skills and abilities.

Only add these skills after you've completed this certificate. Once you start applying the ideas
we talked about here to your resume, you'll be well on your way to setting yourself apart from
other candidates. After you've completed your final course, you'll have the opportunity to
complete a case study and link it on your resume. This'll be a great opportunity to show
recruiters and hiring managers the skills you've learned while earning your certificate. Before
you know it, you'll have a pretty great resume that you can update quickly whenever
you're searching for a data analyst job. Nothing wrong with that. Up next, we'll talk more about
adding experienced to your resume. Bye for now.

Joseph: Black and African American


inclusion in the data industry
Hello, my name is Joseph. I'm a people analyst at Google. As a people analyst, my job is
to work with executives and HR business partners to use data to make informed people

156
decisions. Inclusion is very essential to the work that we do. As you know, sometimes you can
start with data and have your own bias in it. For us in this field that is very sensitive, it requires
that we have a diverse set of people who have different backgrounds to have this lens of data to
work. Being a black professional, I can better tell a story about people of color that is a lot more
personal to me.

Being an analyst requires me to take data and tell a story with it. On a personal standpoint, I'm
very passionate about this space of increasing representation in the tech industry. For example,
outside of work, I run a nonprofit called Sankofa Tech. Our whole goal is essentially to help
develop the next generation of black engineers who can essentially be in this field and represent
our experience using data as a foundation and offer technology as the powering moving factor
going forward. It's critical that we have more black people in the technology sector. As you all
know, in the next 10-20 years, AI, machine learning, will be like just speaking English in this
country or even the entire world. So the more we can have more black people in this field, the
more we can represent it in the products that are being built, and the more that our experiences
are being influenced in every single product that these companies do build.

It's definitely critical that we have more black engineers, we have more black data scientists to
do the analysis, and also just black data analysts to help tell the story that's more inclusive of
our experience as well. It's definitely essential that we do have people from different
backgrounds, colors, creeds to really understand data, and have the alliance to it, and tell the
story, and make it very personal to our audience.

The importance of diversity on a


data analytics team
As you’ve been learning, there are aspects of data analysis in which bias can play a
role. This is why diversity and inclusion are important in data teams and can often be
key to taking fair and accurate insights from the data. In this reading, you’re going to
explore some of the ways bias can influence data analysis, the ethical considerations of
working with data, and how you can mitigate bias.

157
Data collection and bias

Even before data is collected, bias might have been involved in determining what data
is collected and how. Biased data can lead to inaccurate insights, or it can reinforce
existing inequity. An example of this might be a bank that focuses its data collection
efforts on predominantly affluent areas. Perhaps residents there have a history of being
more profitable bank clients. However, this would exclude lower-income individuals from
the bank’s marketing strategies. A more diverse data team is more likely to notice gaps
in data collection and understand the impact on new loan initiatives. Plus, extending
loans to a broader base can help banks attract more customers, opening up new
markets and revenue streams.

Data interpretation and bias

Data analysts interpret data to gain insights and guide decision-making. For example,
an analyst in a healthcare organization might review data about patient outcomes. If the
analyst doesn’t understand or account for disparities in healthcare among various
populations, they may present an incomplete report on what the data means. With a
diverse group of data analysts, it’s more likely that someone will be familiar with these
disparities and factor in additional data to create a more complete picture. For example,
a diverse team analyzing healthcare data may notice that a particular study on heart
disease focuses primarily on male participants. Team members may recognize this
gender bias and recommend conducting a similar study with a more balanced
representation of all participants to understand sex-specific risk factors.

158
Ethical considerations

Data analysts often have access to a great deal of data and must consider what's right
and fair when collecting and interpreting it. It’s important to establish and follow ethical
guidelines for data use, some of which are included below:

 Informed consent: Obtain informed and voluntary consent from individuals before
collecting their data. This ensures that people are aware of how their data will be used
and gives them the opportunity to make an informed decision about data sharing.
 Anonymity and confidentiality: Protect the privacy of individuals through de-
identification and confidentiality measures. This guideline helps prevent unauthorized
access to sensitive data and maintains trust in data handling.
 Data security: Implement robust data security measures to safeguard data from
breaches and unauthorized access. Data security is foundational for protecting
individuals' information.
 Transparency: Transparency in data use practices is critical for building trust.
Clearly communicate how data is collected, processed, and shared.
 Data ownership and control: Make sure to give people control over their data,
including the ability to access, correct, or delete it.

If your team is small or lacks diversity, there are steps you can take to identify and
mitigate bias in data collection and analysis:

 Educate yourself on unconscious bias and its impact on data analysis.


 Review data collection methods to find and fix bias in tools and survey questions.
 Be transparent in documenting all aspects of your data collection and analysis methods
to allow others to review and identify potential biases in your methodology.
 Consult with diverse data experts and seek peer reviews from colleagues with diverse
perspectives.
 Engage in continuous learning and in discussions with other data analysts to stay up to
date on the latest research and techniques.

Key takeaways

Bias in data collection and interpretation can have wide-ranging impacts. Data analysts
can make a positive impact by practicing ethical data collection e and striving to engage
diverse voices in their work.
159
Add technical skills to your
resume
Technical skills are crucial when building a solid resume. They demonstrate to
employers that you have the skills necessary to successfully perform the job. Think of
technical skills as your toolbox: How you list each skill on your resume is how you
demonstrate to employers that you are capable of using those tools.

Common technical skills for entry-level data analysts

It takes lots of skills to be a successful data analyst, and these are some common ones
that employers seek out when hiring for data analyst jobs:

160
161
1. Structured Query Language (SQL): SQL is considered a basic skill that is
pivotal to any entry-level data analyst position. SQL helps you communicate with
databases, and more specifically, it is designed to help you retrieve information from
databases. Every month, thousands of data analyst jobs posted require SQL, and
knowing how to use SQL remains one of the most common job functions of a data
analyst.

2. Spreadsheets: Although SQL is popular, 62% of companies still prefer to use


spreadsheets for their data insights. When getting your first job as a data analyst, the
first version of your database might be in spreadsheet form, which is still a powerful tool
for reporting or even presenting data sets. So, it is important for you to be familiar with
using spreadsheets for your data insights.

3. Data visualization tools: Data visualization tools help to simplify complex data
and enable the data to be visually understood. After gathering and analyzing data, data
analysts are tasked with presenting their findings and making that information simple to
grasp. Common tools that are used in data analysis include Tableau, Microstrategy,
Data Studio, Looker, Datarama, Microsoft Power BI, and many more. Among these,
Tableau is best known for its ease of use, so it is a must-have for beginner data
analysts. Also, studies show that data analysis jobs requiring Tableau are expected to
grow about 34.9% over the next decade.

4. Python programming: Although you don’t need to be an expert in programming


languages as an entry-level data analyst, understanding the fundamentals of Python is
a big plus. Python is the most widely used programming language among data analysts,
and the Python skills you'll learn in a later course will be a valuable tool for data
analysis.

Note: R is also a great addition to have as you become more advanced in your
career.

Key takeaways

As a data analyst, you are often asked to collect and analyze data with a specific
purpose in mind. Knowing which platform and language to use helps you analyze the
data to decipher which information is important, to probe for any anomalies, prepare
questions, assess risks, and so much more.

162
Where does your interest lie?
en

Interactive Transcript - Enable basic transcript


mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text
press CTRL + S. To expand your selection you may use CTRL + arrow key. You may
contract your selection using shift + CTRL + arrow key. For screen readers that are
incompatible with using arrow keys for shortcuts, you can replace them with the H J K L
keys. Some screen readers may require using CTRL in conjunction with the alt key

Hello. If you haven't done a search for a data analyst job yet, give it a try. One thing you might
notice is how many variations of data analysts jobs there are. You'll find some that just say "data
analyst" in the job title, and others that include more details like "market research analyst" and
"digital data analyst." This variety is a good thing. It means that as a data analyst, you'll have a
pretty wide range of job opportunities available. While you might not be the right fit for every
position that's posted, every position that's posted might not be the right fit for you.

As you continue moving forward, it's important to keep your own interests in mind. There might
be certain topics that we've covered or we'll cover that you find yourself especially interested
in. When you're job hunting, you might want to tailor your search to find jobs that are focused on
or include your areas of interest. For example, if a job description lists data cleaning as a job
responsibility and you think that you'd really enjoy that process, you could make that job your
top priority. At the same time, think about your other interests. If you have a background in retail
or medicine or finance and had a good experience with it, you might apply for jobs that match
your background. As an added bonus, your experience will look great on your resume.

But it's also okay to search for jobs in an area of personal interest where you have no
professional experience. If you've always loved cars, check out what positions the auto industry
has. If you're fascinated by how utility companies work, hunt for jobs in the energy and utilities
industry. Finding a job is great. Finding a job you love is even better. Always keep in mind that
data analytics is constantly evolving within lots of different industries. Job titles and hiring needs
might also change.

163
But the opportunities, no matter what they are when you're searching, will be there. Now let's
preview some of the many kinds of data analyst jobs that are out there. The certificate you earn
here will be most applicable to junior or associate data analyst positions. But that doesn't mean
you have to limit your job search to only postings for junior or associate analysts. Job titles
come in all shapes and sizes. New analysts work in a wide range of industries. Health care
analysts gather and interpret data from sources like electronic health records and patient
surveys.

Their work helps organizations improve the quality of their care. Health care analysts might also
look for ways to lower the cost of care and improve patient experience. Data analysts in
marketing complete quantitative and qualitative market analysis. They identify important
statistics and interpret and present their findings to help stakeholders understand the data
behind their marketing strategies. Business intelligence analysts help companies use data
they've collected to increase their efficiency and maximize their profits. These analysts usually
work with large amounts of data to identify trends and generate business insights. Financial
analysts also work with lots of data.

Really all analysts do. But financial analysts use the data to identify and potentially
recommend business and investment opportunities. If you're a junior analyst in this field, you'll
probably start off doing a lot of data gathering and financial modeling as well as spreadsheet
maintenance. This is just a small taste of the types of data analyst jobs out there. Each type
we've covered can branch out into other industries as well. For example, business intelligence
analysts can work in health care, government, e-commerce and more. It's exciting to think about
the possibilities.

There's more work for you to do of course, but there's nothing wrong with looking ahead. When
you get to that place you're looking ahead to, you'll be able to take charge and find the best job
for you. For the time being, we'll keep exploring your resume. See you soon.

164

You might also like