FDSA Unit1
FDSA Unit1
Why Python?
Python is a great language for data science because it has many data science libraries
available, and it’s widely supported by specialized software.
Every popular NoSQL database has a Python-specific API.
Ability to prototype quickly with Python.
Good performance,
Python’s influence is steadily growing in the data science world.
As the amount of data continues to grow and the need to leverage it becomes more important,
every data scientist will come across big data projects throughout their career.
Data science and big data are used almost everywhere in both commercial and non-commercial
settings.
Here are some of the fields where Data Science and Big Data are widely used.
2
(v) Non governmental organizations
o Non governmental organizations (NGOs) are use data to raise money and defend their
causes.
o The World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Many data scientists devote part of their time to helping NGOs, because NGOs
o often lack the resources to collect data and employ data scientists.
DataKind is one such data scientist group that devotes its time to the benefit of
mankind.
(vi) Universities
Universities use data science in their research but also to enhance the study experi ence of
their students.
Massive open online courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.
MOOCs example: Coursera, Udacity, and edX.
MOOCs allow you to stay up to date by following courses from top universities.
3
Figure 1.2 Email is an example of unstructured data and natural language data.
1.3.4Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
Machine-generated data is becoming a major data resource and will continue to do so. Ikibon
has fore- cast that the market value of the industrial Internet (a term coined by
The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and
telemetry (figure 1.3).
The machine data is a classic table-structured database. This isn’t the best approach for highly
interconnected or “networked” data, where the relationships between entities have a valuable role to play.
4
Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.
Examples of graph-based data can be found on many social media websites (figure 1.4).
5
(1) Setting the research goal
Data science is mostly applied in the context of an organization. When
the business asks you to perform a data science project, you’ll first prepare a project
charter.
Charter contains information such as :
what you’re going to research
how the company benefits from that
what data and resources you need
timetable
deliverables
An Iterative Process
o the data science process is not always linear
o in reality we often have to step back and rework certain findings
o we might gain incremental insights, which may lead to new questions.
o To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.
7
1.5 SETTING THE RESEARCH GOAL (Step 1)
A project starts by understanding the below three,
What
Why
How
Answering these questions (what, why, how) is the goal of the first phase, so that everybody
knows what to do and can agree on the best course of action.
Outcome should be a
clear research goal
a good understanding of the context
well-defined deliverables
plan of action with a timetable
This information is then best placed in a project charter.
The length and formality can, of course, differ between projects and companies.
In this early phase of the project, people skills and business acumen are more
important than great technical prowess,
This phase is often be guided by more senior personnel.
1.5.1 Spend time understanding the goals and context of your research
An essential outcome is the research goal as it states the purpose of your
assignment in a clear and focused manner
Understanding the business goals and context is critical for project success.
Continue asking questions and devising examples until you grasp the exact business
expectations,
Identify how your project fits in the bigger picture, appreciate how your research is
going to change the business
Understand how they’ll use your results.
Many data scientists fail here: despite their mathematical wit and scientific brilliance, they
never seem to grasp the business goals and context.
Data can be stored in many forms, ranging from simple text files to tables in a database.
The objective now is acquiring all the data you need.
Data is often like a diamond in the rough: it needs polishing to be of any use to you.
Table 2.1 A list of open-data providers that should get you started
10
1.7 CLEANSING, INTEGRATING, AND TRANSFORMING DATA (Step 3)
The data received from the data retrieval phase is likely to be “a diamond in the rough.”
The task now is to sanitize and prepare it for use in the modeling and reporting phase.
Doing so is tremendously important because models will perform better and we’ll lose less
time trying to fix strange output.
Your model needs the data in a specific format, so data transformation will always come into
play.
It’s a good habit to correct data errors as early on in the process as possible; which might not
be always possible, but steps should be taken.
1.7.1 Cleansing data
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes a true and consistent representation of the processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist. They are,
1. Interpretation error:
Such as when you take the value in your data for granted, like saying that a person’s age is
greater than 300 years.
2. Inconsistencies between data sources or against your company’s standardized values:
An example of this class of errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
At the data cleansing stage, these advanced methods are, however, rarely applied and
often regarded by certain data scientists as overkill.
ERRORS
I. DATA ENTRY ERRORS
Data collection and data entry are error-prone processes.
a) Human intervention
They can make typos or lose their concentration for a second and introduce an
error into the chain.
b) Machine/Computers
Data collected by machines or computers isn’t free from errors either.
Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
Examples of errors originating from machines are transmission errors or bugs in the
extract, transform, and load phase.
For small data sets you can check every value by hand.
11
Detecting data errors when the variables you study don’t have many classes can be
done by tabulating the data with counts.
When you have a variable that can take only two values: “Good” and “Bad”, we
can create a frequency table and see if those are truly the only two values present.
In table 2.3, the values “Godo” and “Bade” point out something went wrong in at
least 16 cases.
Table 2.3 Detecting outliers on simple variables with a frequency table
Value Count
Good 1598647
Bad 1354468
Godo 15
Bade 1
II. REDUNDANT WHITESPACE
Whitespaces tend to be hard to detect but cause errors like other redundant characters
would.
Example, whitespaces at the end of a string, can be a bug, which can be difficult to be
found.
After looking for days through the code, you finally find the bug.
Then comes the hardest part: explaining the delay to the project stakeholders.
The cleaning during the ETL(Extract, Transorm and Load) phase wasn’t well executed, and
keys in one table contained a whitespace at the end of a string.
This caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in
most programming languages.
They all provide string functions that will remove the leading and trailing whitespaces.
For instance, in Python you can use the strip()function to remove leading and trailing spaces.
IV. OUTLIERS
An outlier is an observation that seems to be distant from other observations or,
more specifically, one observation that follows a different logic or generative process
than the other observations.
Outliers may be exceptions that stand outside individual samples of populations.
An outlier is a data point that is noticeably different from the rest.
12
The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values. An example is shown in figure 2.6.
The normal distribution, or Gaussian distribution, is the most common
distribution in natural sciences.
13
o A code book also tells the type of data you’re looking at: is it hierarchical,
graph, something else?
If you have multiple values to check, it’s better to put them from the code book into a
table and use a difference operator to check the discrepancy between both tables.
14
When we combine data, you have the option to create a new physical table or
a virtual table by creating a view.
The advantage of a view is that it doesn’t consume more disk space
Joining tables
Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table.
Example: To join tables, you use variables that represent the same object in both tables,
such as a date, a country name, or a Social Security number.
These common fields are known as keys.
When these keys also uniquely define the records in the table they are called primary
keys.
Appending Tables
Appending or stacking tables is effectively adding observations from one
table to another table.
15
ENRICHING AGGREGATED MEASURES
Data enrichment can also be done by adding calculated information to the table.
Extra measures such as these can add perspective.
Transforming Data
Certain models require their data to be in a certain shape.
o Now since the data has been cleansed and integrated the data, this is the next task we can
perform: transforming the data so it takes a suitable form for data modeling.
o Relationships between an input variable and an output variable aren’t always linear.
o Take, for instance, a relationship of the form y = aebx.
o Taking the log of the independent variables simplifies the estimation problem dramatically.
o During exploratory data analysis you take a deep dive into the data.
o Information becomes much easier to grasp when shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of your data and the interactions between variables.
o Some anomalies may still be left out, thus forcing to take a step back and fix them.
17
o The visualization techniques you use in this phase range from simple line graphs or
histograms, complex diagrams such as Sankey and network graphs.
o Sometimes it’s useful to compose a composite graph from simple graphs to get even
more insight into the data.
o Other times the graphs can be animated or made interactive to make it easier.
Bar Chart:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.
18
Multiple plots:
These plots can be combined to provide even more insight.
19
Histogram:
A variable is cut into discrete categories and the number of occurrences in each category
are summed up.
Boxplot:
The boxplot, on the other hand, doesn’t show how many observations are present but does
offer an impression of the distribution within categories. It can show the maximum, minimum,
median, and other characterizing measures at the same time.
Sankey Diagram:
A sankey diagram is a visualization used to depict a flow from one set of values to another.
The things being connected are called nodes and the connections are called links.
Network Graph:
Network diagrams (also called Graphs) show interconnections between a set of entities.
Each entity is represented by a Node (or vertice). Connections between nodes are represented
through links (or edges).
o The techniques used are borrowed from the field of machine learning, data mining,
and/or statistics.
o Building a model is an iterative process.
o The way you build your model depends on whether you go with classic statistics or
the somewhat more recent machine learning school, and the type of technique to be
used.
Either way, most models consist of the following main steps:
a. Model and variable selection
b. Model execution
c. Model diagnostics and model comparison
21
The [Link]() outputs the table.
■ Model fit :
For this the R-squared or adjusted R-squared is used.
This measure is an indication of the amount of variation in the
data that gets captured by the model.
The difference between the adjusted R-squared and the R-squared
is minimal here because the adjusted one is the normal one + a
penalty for model complexity.
22
A model gets complex when many variables (or features) are
introduced.
You don’t need a complex model if a simple model is available, so
the adjusted R-squared punishes you for overcomplicating.
For models in businesses, models above 0.85 are often considered
good. High 90s is very good.
For research however, often very low model fits (<0.2 even) are
found.
■ Predictor variables have a coefficient:
For a linear model this is easy to interpret. In our example if you add
“1” to x1, it will change y by “0.7658”.
Coefficients are great, but sometimes not enough evidence exists to
show that the influence is there.
■ Predictor significance :
If, for instance, you determine that a certain
gene is significant as a cause for cancer, this is
important knowledge.
detecting influences is more important in scientific
studies than perfectly fitting models
But when do we know a gene has that impact?
This is called significance.
This is what the p-value is about.
k-nearest neighbors.
Linear regression works, to predict a value, but if we want to classify something? Then
we need to f o r classification models, the best known among them being k-nearest neighbors.
k-nearest neighbors looks at labeled points nearby an unlabeled point and,
based on this, makes a prediction of what the label should be.
23
[Link]() : returns the model accuracy, but by “scoring a model” we often mean
applying it on data to make a prediction.
prediction = [Link](predictors)
Now we can use the prediction and compare it to the real thing using a confusion
matrix.
metrics.confusion_matrix(target,prediction)
Confusion matrix :
It’s fairly easy to use models that are available in R within Python with the help of the RPy
library.
RPy provides an interface from Python to R.
R is a free software environment, widely used for statistical computing.
We will be building multiple models from which we can choose the best one based on
multiple criteria. Working with a holdout sample helps in picking the best-performing model.
24
A holdout sample is a part of the data you leave out of the model building so it can be
used to evaluate the model afterward.
o The principle here is simple: the model should work on unseen data.
o You use only a fraction of your data to estimate the model and the other part, the holdout
sample, is kept out of the equation.
o The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
o The error measure used in the example is the mean square error.
o Mean square error is a simple measure: check for every prediction how far it
was from the truth, square this error, and add up the error of every prediction.
We use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
Once the model is trained, we predict the values for the other 20% of the
variables based on those for which we already know the true value, and calculate
the model error with an error measure.
Then we choose the model with the lowest error.
In this example we chose model 1 because it has the lowest total error.
Many models make strong assumptions, such as independence of the inputs, and
we have to verify that these assumptions are indeed met.
This is called model diagnostics.
Once we have a working model we are ready to go to the last step.
25
After the data is analyzed the data and a well-performing model is built, the findings are being
ready to present to the world.
o Sometimes we need to repeat our work over and over again because of the predictions of
the models or the insights that we produced.
o For this reason, it is necessary to automate our models.
o This doesn’t always mean that you have to redo all of your analysis all the time.
o Sometimes it’s sufficient to implement only the model scoring; or can build an
application that automatically updates reports, Excel spreadsheets, or PowerPoint
presentations.
o Our soft skills will be most useful at this stage.
26