0% found this document useful (0 votes)
4 views26 pages

FDSA Unit1

The document provides an introduction to data science, outlining its need, benefits, and the data science process. It discusses the characteristics and challenges of big data, differentiates between data scientists and statisticians, and highlights the importance of Python in data science. Additionally, it details the various types of data, the steps involved in the data science process, and emphasizes the significance of setting clear research goals and retrieving relevant data.

Uploaded by

gshinymol94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views26 pages

FDSA Unit1

The document provides an introduction to data science, outlining its need, benefits, and the data science process. It discusses the characteristics and challenges of big data, differentiates between data scientists and statisticians, and highlights the importance of Python in data science. Additionally, it details the various types of data, the steps involved in the data science process, and emphasizes the significance of setting clear research goals and retrieving relevant data.

Uploaded by

gshinymol94
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT I

INTRODUCTION TO DATA SCIENCE


Need for data science – benefits and uses – facets of data – data science process – setting the research
goal – retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build
the models – presenting and building applications.

1.1 NEED FOR DATA SCIENCE


Big Data : It is a blanket term for any collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques.
Data science involves using methods to analyze massive amounts of data and extract the knowledge it
contains.

Characteristics of big data

Fig. 5 Vs of Big Data

It often referred to as the five Vs:


o Volume:How much data is there?
o Variety:How diverse are different types of data?
o Velocity:At what speed is new data generated?
o Veracity: How accurate is the data?
o Value : The value the data provides.
These five properties make big data different from the data found in traditional data
management tools.

Challenges of big data:


 data capture
 curation
 storage
 search
 sharing
 transfer
 visualization.
 specialized techniques to extract the insights.

Role of Data Science


 It is an evolutionary extension of statistics capable of dealing with the massive amounts of
data produced today.
 It adds methods from computer science to the repertoire of statistics.
1
Data Scientist Vs Statisician
 The main things that set a data scientist apart from a statistician are the ability to work with
big data and experience in machine learning, computing, and algorithm building.
 Their tools tend to differ too, with data scientist job descriptions more frequently mentioning
the ability to use Hadoop, Pig, Spark, R, Python, and Java, among others.

Why Python?
 Python is a great language for data science because it has many data science libraries
available, and it’s widely supported by specialized software.
 Every popular NoSQL database has a Python-specific API.
 Ability to prototype quickly with Python.
 Good performance,
 Python’s influence is steadily growing in the data science world.
As the amount of data continues to grow and the need to leverage it becomes more important,
every data scientist will come across big data projects throughout their career.

1.2 BENEFITS AND USES OF DATA SCIENCE

Data science and big data are used almost everywhere in both commercial and non-commercial
settings.
Here are some of the fields where Data Science and Big Data are widely used.

(i) Commercial companies


 Gain insights into their customers, processes, staff, completion, and products.
 Offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings.
 Example, Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet.

(ii) Human resource professionals


These professionals use people analytics and text mining to :
 screen candidates
 monitor the mood of employees
 study informal networks among coworkers.

(iii) Financial institutions


 predict stock markets,
 determine the risk of lending money
 learn how to attract new clients for their services.
 50% of trades worldwide are performed automatically by machines based on algorithms
developed by quant.
 Quants are data scientists who work on trading algorithms with the help of big data and data
science techniques.

(iv) Governmental Organizations


 Rely on internal data scientists to discover valuable information, but also share their data
with the public.
 Use the data to gain insights or build data-driven applications.
 A data scientist gets to work on diverse projects such as detecting fraud and other
criminal activity or optimizing project funding.

2
(v) Non governmental organizations
o Non governmental organizations (NGOs) are use data to raise money and defend their
causes.
o The World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Many data scientists devote part of their time to helping NGOs, because NGOs
o often lack the resources to collect data and employ data scientists.
 DataKind is one such data scientist group that devotes its time to the benefit of
mankind.

(vi) Universities
 Universities use data science in their research but also to enhance the study experi ence of
their students.
 Massive open online courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.
 MOOCs example: Coursera, Udacity, and edX.
 MOOCs allow you to stay up to date by following courses from top universities.

1.3. FACETS OF DATA


In data science and big data there are many different types of data, and each of them tends to
require different tools and techniques.
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming

1.3.1 Structured data


Structured data is data that depends on a data model and resides in a fixed field within a record.
o Easy to store structured data in tables within data- bases or Excel files.
o SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.

Figure 1.1 An Excel table is an example of structured data.

1.3.2 Unstructured data


The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines. More often, data comes unstructured. Example Email (fig.1.2)
 Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.
 Although email contains structured elements such as the sender, title, and body text, it’s a
challenge because thousands of different languages and dialects out there further
complicate this.

3
Figure 1.2 Email is an example of unstructured data and natural language data.

1.3.3 Natural language


Natural language is a special type of unstructured data; it’s challenging to process because
it requires knowledge of specific data science techniques and linguistics.
 The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis.
 But models trained in one domain don’t generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
 Even humans struggle with natural language as well. It’s ambiguous by nature.
 The concept of meaning itself is questionable here.
 The meaning of the same words can vary when coming from someone upset or joyous.

1.3.4Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
 Machine-generated data is becoming a major data resource and will continue to do so. Ikibon
has fore- cast that the market value of the industrial Internet (a term coined by
 The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and
telemetry (figure 1.3).

Figure 1.3 Example of machine-generated data

The machine data is a classic table-structured database. This isn’t the best approach for highly
interconnected or “networked” data, where the relationships between entities have a valuable role to play.

1.3.5 Graph-based or network data


“Graph” points to mathematical graph theory. In graph theory, a graph is a mathematical
structure to model pair-wise relationships between objects.
 Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and store graphical data.

4
 Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.
 Examples of graph-based data can be found on many social media websites (figure 1.4).

Figure 1.4 Friends in a social network are an example of graph-based data.


Your follower list on Twitter is another example of graph-based data.
 Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.
 Graph data poses its challenges, but for a computer interpreting additive and image data, it can
be even more difficult.

1.3.6 Audio, image, and video


Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
 High-speed cameras at stadiums will capture ball and athlete movements to calculate in real
time.
 Recently a company called DeepMind succeeded at creating an algorithm that’s capable of
learning how to play video games.
 This algorithm takes the video screen as input and learns to interpret everything via a complex
process of deep learning.
 Google bought the company for Artificial Intelligence (AI) development plans. The learning
algorithm takes in data as it’s pro- duced by the computer game; it’s streaming data.

1.3.7 Streaming data


Streaming data flows into the system when an event happens instead of being loaded into a
data store in a batch.
 Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.

1.4 THE DATA SCIENCE PROCESS


The data science process typically consists of six steps.

5
(1) Setting the research goal
Data science is mostly applied in the context of an organization. When
the business asks you to perform a data science project, you’ll first prepare a project
charter.
 Charter contains information such as :
 what you’re going to research
 how the company benefits from that
 what data and resources you need
 timetable
 deliverables

(2) Data Retrieval

In this step the data is collected.


 In the charter, it is stated which data you need and where you can find it.
 In this step it is ensured that the data in your program can be used, which means checking the
existence of, quality, and access to the data.
 Data can also be delivered by third-party companies and takes many forms ranging from
Excel spreadsheets to different types of databases.

(3) Data preparation


Data collection is an error-prone process; in this phase you enhance the quality of
the data and prepare it for use in subsequent steps.
It consists of three sub- phases:
1. data cleansing: removes false values from a data source and inconsistencies across
data sources,
2. data integration: enriches data sources by combining information from multiple
data sources
3. data transformation: ensures that the data is in a suitable format for use in your
models.

(4) Data exploration


Data exploration is concerned with building a deeper understanding of your data.
We can try to understand how variables interact with each other, the distribution
of the data, and whether there are outliers.
To achieve this you mainly use:
 descriptive statistics
 visual techniques
 simple modeling
This step is called Exploratory Data Analysis (EDA).

(5) Data modeling or model building


In this phase models, domain knowledge, and insights about the data found
in the previous steps are used to answer the research question.
It has the following steps:
 Select a technique from the fields of statistics, machine learning, operations
research, and so on.
 Building a model is an iterative process that involves selecting the variables for the
model, executing the model, and model diagnostics.
6
(6) Presentation and automation
Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports.
Sometimes automation of the execution of the process is needed because the
business will want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.

An Iterative Process
o the data science process is not always linear
o in reality we often have to step back and rework certain findings
o we might gain incremental insights, which may lead to new questions.
o To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.

Fig. Overview of Data Science process

7
1.5 SETTING THE RESEARCH GOAL (Step 1)
A project starts by understanding the below three,
What
Why
How

 Answering these questions (what, why, how) is the goal of the first phase, so that everybody
knows what to do and can agree on the best course of action.
 Outcome should be a
 clear research goal
 a good understanding of the context
 well-defined deliverables
 plan of action with a timetable
 This information is then best placed in a project charter.
 The length and formality can, of course, differ between projects and companies.
 In this early phase of the project, people skills and business acumen are more
important than great technical prowess,
 This phase is often be guided by more senior personnel.
1.5.1 Spend time understanding the goals and context of your research
An essential outcome is the research goal as it states the purpose of your
assignment in a clear and focused manner
 Understanding the business goals and context is critical for project success.
 Continue asking questions and devising examples until you grasp the exact business
expectations,
 Identify how your project fits in the bigger picture, appreciate how your research is
going to change the business
 Understand how they’ll use your results.
 Many data scientists fail here: despite their mathematical wit and scientific brilliance, they
never seem to grasp the business goals and context.

1.5.2 Create a project charter


Clients like to know upfront what they’re paying for.
o After you have a good understanding of the business problem, try to get a formal
agreement on the deliverables.
o All this information is best collected in a project charter.
o For any significant project this would be mandatory.
o A project charter requires teamwork, and your input covers at least the following:
 A clear research goal
 The project mission and context
8
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline
The client can use this information to make an estimation of the project costs and the
data and people required for the project to become a success.

1.6 RETRIEVING DATA (Step 2)


The second step in data science is to retrieve the required data (figure 2.3).
Many companies will have already collected and stored the data, and what they don’t
have can often be bought from third parties.
We need to look outside your organization for data, because more and more organizations
are making even high-quality data freely available for public and commercial use.

 Data can be stored in many forms, ranging from simple text files to tables in a database.
 The objective now is acquiring all the data you need.
 Data is often like a diamond in the rough: it needs polishing to be of any use to you.

1.6.1 Start with data stored within the company


 First act should be to assess the relevance and quality of the data that’s readily available within
your company.
 Most companies have a program for maintaining key data, so much of the cleaning work may
already be done.
 This data can be stored in official data repositories that are maintained by a team of IT
professionals, such as :
o Databases – designed for data storage.
o Data marts - is a subset of the data warehouse and geared toward serving a specific business
unit
o Data warehouses- designed for reading and analyzing that data.
o Data lakes - contains data in its natural or raw format,
While data warehouses and data marts are home to preprocessed data, data lakes.
Challenges in finding data in our own company:
 As companies grow, their data becomes scattered around many places.
 Knowledge of the data may be dispersed as people change positions and leave the company.
 Documentation and metadata aren’t always the top priority of a delivery manager,.
Chinese Walls:
 Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more.
 These policies translate into physical and digital barriers called Chinese walls.
9
 These “walls” are mandatory and well-regulated for customer data in most countries.
 This is for good reasons, too; imagine everybody in a credit card company having access to
your spending habits.
 Getting access to the data may take time and involve company politics.

1.6.2 Searching data outside organization


Many companies specialize in collecting valuable information. Eg. Nielsen and GFK are well
known for this in the retail industry.
o Other companies provide data so that you, in turn, can enrich their services and ecosystem.
Such is the case with Twitter, LinkedIn, and Facebook.
o Governments and organizations share their data for free with the world.
o This data can be of excellent quality; it depends on the institution that creates and manages it.
o Eg., information they share can be on the number of accidents or amount of drug abuse and its
demographics.
This data is helpful when you want to enrich proprietary data but also convenient when training
your data science skills at home.

Open data site Description


[Link] The home of the US Government’s open data
[Link] An open database that retrieves its information from sites
like Wikipedia, MusicBrains, and the SEC archive
[Link] Open data initiative from the World Bank
[Link] Open data for international development
[Link] Open data from the US Food and Drug Administration

Table 2.1 A list of open-data providers that should get you started

Data Quality Checks


o Most of the errors you’ll encounter during the data- gathering phase are easy to spot, but
being too careless will make you spend many hours solving data issues that could have been prevented
during data import.
o Investigate the data during the import, data preparation, and exploratory phases.
o During data retrieval, you check to see if the data is equal to the data in the source document
and look to see if you have the right data types.
 When you have enough evidence that the data is similar to the data you find in the source
document, you stop.
o During data preparation, you do a more elaborate check.
 If you did a good job during the previous phase, the errors you find now are also
present in the source document.
 The focus is on the content of the variables: you want to get rid of typos and other data entry
errors and bring the data to a common standard among the data sets.
o During the exploratory phase your focus shifts to what you can learn from the data.
 Assume the data to be clean and look at the statistical properties such as distributions,
correlations, and outliers.
 It is often iterate over these phases.

10
1.7 CLEANSING, INTEGRATING, AND TRANSFORMING DATA (Step 3)
 The data received from the data retrieval phase is likely to be “a diamond in the rough.”
 The task now is to sanitize and prepare it for use in the modeling and reporting phase.
 Doing so is tremendously important because models will perform better and we’ll lose less
time trying to fix strange output.
 Your model needs the data in a specific format, so data transformation will always come into
play.
 It’s a good habit to correct data errors as early on in the process as possible; which might not
be always possible, but steps should be taken.
1.7.1 Cleansing data
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes a true and consistent representation of the processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist. They are,
1. Interpretation error:
Such as when you take the value in your data for granted, like saying that a person’s age is
greater than 300 years.
2. Inconsistencies between data sources or against your company’s standardized values:
An example of this class of errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
At the data cleansing stage, these advanced methods are, however, rarely applied and
often regarded by certain data scientists as overkill.

ERRORS
I. DATA ENTRY ERRORS
Data collection and data entry are error-prone processes.
a) Human intervention
They can make typos or lose their concentration for a second and introduce an
error into the chain.
b) Machine/Computers
 Data collected by machines or computers isn’t free from errors either.
 Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
 Examples of errors originating from machines are transmission errors or bugs in the
extract, transform, and load phase.
 For small data sets you can check every value by hand.

11
 Detecting data errors when the variables you study don’t have many classes can be
done by tabulating the data with counts.
 When you have a variable that can take only two values: “Good” and “Bad”, we
can create a frequency table and see if those are truly the only two values present.
 In table 2.3, the values “Godo” and “Bade” point out something went wrong in at
least 16 cases.
Table 2.3 Detecting outliers on simple variables with a frequency table
Value Count
Good 1598647
Bad 1354468
Godo 15
Bade 1
II. REDUNDANT WHITESPACE
 Whitespaces tend to be hard to detect but cause errors like other redundant characters
would.
 Example, whitespaces at the end of a string, can be a bug, which can be difficult to be
found.
 After looking for days through the code, you finally find the bug.
 Then comes the hardest part: explaining the delay to the project stakeholders.
 The cleaning during the ETL(Extract, Transorm and Load) phase wasn’t well executed, and
keys in one table contained a whitespace at the end of a string.
 This caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
 If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in
most programming languages.
 They all provide string functions that will remove the leading and trailing whitespaces.
 For instance, in Python you can use the strip()function to remove leading and trailing spaces.

Fixing Capital Letter Mismatches


Capital letter mismatches are common.
o Most programming languages make a distinction between “Brazil” and “brazil”.
In this case you can solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python. “Brazil”.lower() == “brazil”.lower()should result in true.

III. IMPOSSIBLE VALUES AND SANITY CHECKS


Check the value against physically or theoretically impossible values such as people
taller than 3 meters or someone with an age of 299 years. Sanity checks can be directly expressed
with rules:
check = 0 <= age <= 120

IV. OUTLIERS
An outlier is an observation that seems to be distant from other observations or,
more specifically, one observation that follows a different logic or generative process
than the other observations.
 Outliers may be exceptions that stand outside individual samples of populations.
 An outlier is a data point that is noticeably different from the rest.

12
 The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values. An example is shown in figure 2.6.
The normal distribution, or Gaussian distribution, is the most common
distribution in natural sciences.

Which technique to use at what time is dependent on your particular case.

V. Deviations from a code book


 Detecting errors in larger data sets against a code book or against standardized values
can be done with the help of set operations.
 A code book is a description of your data, a form of metadata.
o It contains things such as the number of variables per observation, the
number of observations, and what each encoding within a variable means.

13
o A code book also tells the type of data you’re looking at: is it hierarchical,
graph, something else?
 If you have multiple values to check, it’s better to put them from the code book into a
table and use a difference operator to check the discrepancy between both tables.

VI. Different units of measurement


When integrating two data sets, you have to pay attention to their respective units of
measurement.
An example of this would be when you study the prices of gasoline in the world. To do this
you gather data from different data providers. Data sets can contain prices per gallon and others
can contain prices per liter. A simple conversion will do the trick in this case.

VII. Different levels of aggregation


 Having different levels of aggregation is similar to having different types of
measurement.
 An example of this would be a data set containing data per week versus one containing
data per work week.
o This type of error is generally easy to detect, and summarizing the data sets will fix it.
After cleaning the data errors, you combine information from different data sources.

1.7.2 Correct errors as early as possible


A good practice is to mediate data errors as early as possible in the data collection
chain and to fix as little as possible inside your program while fixing the origin of the problem.
 The data collection process is error- prone, and in a big organization it involves many
steps and teams.
 Data should be cleansed when acquired for many reasons:
■ Not everyone spots the data anomalies.
■ If errors are not corrected early on in the process, the cleansing will have to be done for
every project that uses that data.
■ Data errors may point to a
o business process that isn’t working as designed
o defective equipment, such as broken transmission lines and defective sensors.
o bugs in software or in the integration of software that may be critical to the
company.
 As a final remark: always keep a copy of your original data.

1.7.3 Combining data from different data sources


Data comes from several different places, and in this substep we focus on integrating these
different sources.
Data varies in size, type, and structure, ranging from databases and Excel files to text
documents.
The Different Ways Of Combining Data
There are two operations to combine information from different data sets.
1. Joining: enriching an observation from one table with information from another
table.
2. Appending or Stacking: adding the observations of one table to those of another table.

14
 When we combine data, you have the option to create a new physical table or
a virtual table by creating a view.
 The advantage of a view is that it doesn’t consume more disk space

 Joining tables
Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table.

Example: To join tables, you use variables that represent the same object in both tables,
such as a date, a country name, or a Social Security number.
These common fields are known as keys.
When these keys also uniquely define the records in the table they are called primary
keys.
 Appending Tables
Appending or stacking tables is effectively adding observations from one
table to another table.

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS


To avoid duplication of data, you virtually combine data with views.
A view behaves as if you’re working on a table, but this table is nothing but a virtual layer
that combines the tables for you.
Drawback:
While a table join is only performed once, the join that creates the view is recreated every
time it’s queried, using more processing power than a pre-calculated table would have.

15
ENRICHING AGGREGATED MEASURES
Data enrichment can also be done by adding calculated information to the table.
Extra measures such as these can add perspective.

Transforming Data
Certain models require their data to be in a certain shape.
o Now since the data has been cleansed and integrated the data, this is the next task we can
perform: transforming the data so it takes a suitable form for data modeling.
o Relationships between an input variable and an output variable aren’t always linear.
o Take, for instance, a relationship of the form y = aebx.
o Taking the log of the independent variables simplifies the estimation problem dramatically.

REDUCING THE NUMBER OF VARIABLES


Having too many variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input variables.
16
 Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data.
 Reducing the number of variables makes it easier to understand the key values.

TURNING VARIABLES INTO DUMMIES


- Variables can be turned into dummy variables.
- Dummy variables can only take two values: true(1) or false(0).
- They’re used to indicate the absence of a categorical effect that may explain the
observation.
- In this case you’ll make separate columns for the classes stored in one variable and
indicate it with 1 if the class is present and 0 otherwise.
- An example is turning one column named Weekdays into the columns Monday through
Sunday.
- You use an indicator to show if the observation was on a Monday; you put 1 on Monday
and 0 elsewhere.
- Turning variables into dummies is a technique that’s used in modeling and is popular
with, but not exclusive to, economists.

1.8. EXPLORATORY DATA ANALYSIS (Step 4)

o During exploratory data analysis you take a deep dive into the data.
o Information becomes much easier to grasp when shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of your data and the interactions between variables.
o Some anomalies may still be left out, thus forcing to take a step back and fix them.

17
o The visualization techniques you use in this phase range from simple line graphs or
histograms, complex diagrams such as Sankey and network graphs.
o Sometimes it’s useful to compose a composite graph from simple graphs to get even
more insight into the data.
o Other times the graphs can be animated or made interactive to make it easier.
Bar Chart:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.

Fig. Bar Chart


Distribution plot:
Distribution plots visually assess the distribution of sample data by comparing
the empirical distribution of the data with the theoretical values expected from a specified
distribution.

Fig. Distribution plot


Line plot:
A line plot, also called a dot plot, is a graph that shows the frequency, or the number of
times, a value occurs in a data set.

Fig. Line plot

18
Multiple plots:
These plots can be combined to provide even more insight.

Fig. Multiple plots


A Pareto diagram:
A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the cumulative total is represented by the line.

Fig. Pareto diagram


Brushing and Linking:
With brushing and linking you combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the other graphs.

19
Histogram:
A variable is cut into discrete categories and the number of occurrences in each category
are summed up.

Boxplot:
The boxplot, on the other hand, doesn’t show how many observations are present but does
offer an impression of the distribution within categories. It can show the maximum, minimum,
median, and other characterizing measures at the same time.

Sankey Diagram:
A sankey diagram is a visualization used to depict a flow from one set of values to another.
The things being connected are called nodes and the connections are called links.

Fig. Sankey Diagram

Network Graph:
Network diagrams (also called Graphs) show interconnections between a set of entities.
Each entity is represented by a Node (or vertice). Connections between nodes are represented
through links (or edges).

Fig. Network Graph


20
1.9. BUILD THE MODELS. (STEP 5)
With clean data in place and a good understanding of the content, we can start to build
models with the goal of making better predictions, classifying objects, or gaining an
understanding of the system that we are modeling.
This phase is much more focused than the exploratory analysis step, because we know
what we’re looking for and what the outcome should be.

o The techniques used are borrowed from the field of machine learning, data mining,
and/or statistics.
o Building a model is an iterative process.
o The way you build your model depends on whether you go with classic statistics or
the somewhat more recent machine learning school, and the type of technique to be
used.
Either way, most models consist of the following main steps:
a. Model and variable selection
b. Model execution
c. Model diagnostics and model comparison

1.9.1 Model and variable selection


It is a need to select the variables to include in your model and a modeling technique.
 The findings from the exploratory analysis should already give a fair idea of what variables will
help you construct a good model.
 Many modeling techniques are available, and choosing the right model for a problem requires
judgment on your part.
Consider model performance and whether the project meets all the requirements to use the model, as
well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
1.9.2 Model execution
Once the model is chosen, we need to implement it in code.
 Python, has libraries such as StatsModels or Scikit-learn for doing this.
These packages use several of the most popular techniques.
 Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process.
The following listing shows the execution of a linear prediction model.

21
The [Link]() outputs the table.

■ Model fit :
 For this the R-squared or adjusted R-squared is used.
 This measure is an indication of the amount of variation in the
data that gets captured by the model.
 The difference between the adjusted R-squared and the R-squared
is minimal here because the adjusted one is the normal one + a
penalty for model complexity.

22
 A model gets complex when many variables (or features) are
introduced.
 You don’t need a complex model if a simple model is available, so
the adjusted R-squared punishes you for overcomplicating.
 For models in businesses, models above 0.85 are often considered
good. High 90s is very good.
 For research however, often very low model fits (<0.2 even) are
found.
■ Predictor variables have a coefficient:
 For a linear model this is easy to interpret. In our example if you add
“1” to x1, it will change y by “0.7658”.
 Coefficients are great, but sometimes not enough evidence exists to
show that the influence is there.
■ Predictor significance :
 If, for instance, you determine that a certain
gene is significant as a cause for cancer, this is
important knowledge.
 detecting influences is more important in scientific
studies than perfectly fitting models
 But when do we know a gene has that impact?
 This is called significance.
 This is what the p-value is about.
k-nearest neighbors.
Linear regression works, to predict a value, but if we want to classify something? Then
we need to f o r classification models, the best known among them being k-nearest neighbors.
k-nearest neighbors looks at labeled points nearby an unlabeled point and,
based on this, makes a prediction of what the label should be.

23
[Link]() : returns the model accuracy, but by “scoring a model” we often mean
applying it on data to make a prediction.

prediction = [Link](predictors)

Now we can use the prediction and compare it to the real thing using a confusion
matrix.

metrics.confusion_matrix(target,prediction)

Confusion matrix :

It is a table that is used to define the performance of a classification algorithm.

 It’s fairly easy to use models that are available in R within Python with the help of the RPy
library.
 RPy provides an interface from Python to R.
 R is a free software environment, widely used for statistical computing.

1.9.3 Model diagnostics and model comparison

We will be building multiple models from which we can choose the best one based on
multiple criteria. Working with a holdout sample helps in picking the best-performing model.

24
A holdout sample is a part of the data you leave out of the model building so it can be
used to evaluate the model afterward.

o The principle here is simple: the model should work on unseen data.
o You use only a fraction of your data to estimate the model and the other part, the holdout
sample, is kept out of the equation.
o The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
o The error measure used in the example is the mean square error.

o Mean square error is a simple measure: check for every prediction how far it
was from the truth, square this error, and add up the error of every prediction.

 We use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
 Once the model is trained, we predict the values for the other 20% of the
variables based on those for which we already know the true value, and calculate
the model error with an error measure.
 Then we choose the model with the lowest error.
 In this example we chose model 1 because it has the lowest total error.
 Many models make strong assumptions, such as independence of the inputs, and
we have to verify that these assumptions are indeed met.
 This is called model diagnostics.
 Once we have a working model we are ready to go to the last step.

1.10. PRESENTING THE BUILDING APPLICATIONS

25
After the data is analyzed the data and a well-performing model is built, the findings are being
ready to present to the world.

 We can explain what we have found to the stakeholders.

o Sometimes we need to repeat our work over and over again because of the predictions of
the models or the insights that we produced.
o For this reason, it is necessary to automate our models.
o This doesn’t always mean that you have to redo all of your analysis all the time.
o Sometimes it’s sufficient to implement only the model scoring; or can build an
application that automatically updates reports, Excel spreadsheets, or PowerPoint
presentations.
o Our soft skills will be most useful at this stage.

26

You might also like