0% found this document useful (0 votes)

4 views26 pages

FDSA Unit1

The document provides an introduction to data science, outlining its need, benefits, and the data science process. It discusses the characteristics and challenges of big data, differentiates between data scientists and statisticians, and highlights the importance of Python in data science. Additionally, it details the various types of data, the steps involved in the data science process, and emphasizes the significance of setting clear research goals and retrieving relevant data.

Uploaded by

gshinymol94

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views26 pages

FDSA Unit1

Uploaded by

gshinymol94

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

UNIT I

INTRODUCTION TO DATA SCIENCE

Need for data science – benefits and uses – facets of data – data science process – setting the research
goal – retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build
the models – presenting and building applications.

1.1 NEED FOR DATA SCIENCE

Big Data : It is a blanket term for any collection of data sets so large or complex that it becomes difficult to
process them using traditional data management techniques.
Data science involves using methods to analyze massive amounts of data and extract the knowledge it
contains.

Characteristics of big data

Fig. 5 Vs of Big Data

It often referred to as the five Vs:

o Volume:How much data is there?
o Variety:How diverse are different types of data?
o Velocity:At what speed is new data generated?
o Veracity: How accurate is the data?
o Value : The value the data provides.
These five properties make big data different from the data found in traditional data
management tools.

Challenges of big data:

 data capture
 curation
 storage
 search
 sharing
 transfer
 visualization.
 specialized techniques to extract the insights.

Role of Data Science

 It is an evolutionary extension of statistics capable of dealing with the massive amounts of
data produced today.
 It adds methods from computer science to the repertoire of statistics.
1
Data Scientist Vs Statisician
 The main things that set a data scientist apart from a statistician are the ability to work with
big data and experience in machine learning, computing, and algorithm building.
 Their tools tend to differ too, with data scientist job descriptions more frequently mentioning
the ability to use Hadoop, Pig, Spark, R, Python, and Java, among others.

Why Python?
 Python is a great language for data science because it has many data science libraries
available, and it’s widely supported by specialized software.
 Every popular NoSQL database has a Python-specific API.
 Ability to prototype quickly with Python.
 Good performance,
 Python’s influence is steadily growing in the data science world.
As the amount of data continues to grow and the need to leverage it becomes more important,
every data scientist will come across big data projects throughout their career.

1.2 BENEFITS AND USES OF DATA SCIENCE

Data science and big data are used almost everywhere in both commercial and non-commercial
settings.
Here are some of the fields where Data Science and Big Data are widely used.

(i) Commercial companies

 Gain insights into their customers, processes, staff, completion, and products.
 Offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings.
 Example, Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet.

(ii) Human resource professionals

These professionals use people analytics and text mining to :
 screen candidates
 monitor the mood of employees
 study informal networks among coworkers.

(iii) Financial institutions

 predict stock markets,
 determine the risk of lending money
 learn how to attract new clients for their services.
 50% of trades worldwide are performed automatically by machines based on algorithms
developed by quant.
 Quants are data scientists who work on trading algorithms with the help of big data and data
science techniques.

(iv) Governmental Organizations

 Rely on internal data scientists to discover valuable information, but also share their data
with the public.
 Use the data to gain insights or build data-driven applications.
 A data scientist gets to work on diverse projects such as detecting fraud and other
criminal activity or optimizing project funding.

2
(v) Non governmental organizations
o Non governmental organizations (NGOs) are use data to raise money and defend their
causes.
o The World Wildlife Fund (WWF), for instance, employs data scientists to increase
the effectiveness of their fundraising efforts.
o Many data scientists devote part of their time to helping NGOs, because NGOs
o often lack the resources to collect data and employ data scientists.
 DataKind is one such data scientist group that devotes its time to the benefit of
mankind.

(vi) Universities
 Universities use data science in their research but also to enhance the study experi ence of
their students.
 Massive open online courses (MOOC) produces a lot of data, which allows universities to
study how this type of learning can complement traditional classes.
 MOOCs example: Coursera, Udacity, and edX.
 MOOCs allow you to stay up to date by following courses from top universities.

1.3. FACETS OF DATA

In data science and big data there are many different types of data, and each of them tends to
require different tools and techniques.
The main categories of data are these:
■ Structured
■ Unstructured
■ Natural language
■ Machine-generated
■ Graph-based
■ Audio, video, and images
■ Streaming

1.3.1 Structured data

Structured data is data that depends on a data model and resides in a fixed field within a record.
o Easy to store structured data in tables within databases or Excel files.
o SQL, or Structured Query Language, is the preferred way to manage and query data that
resides in databases.

Figure 1.1 An Excel table is an example of structured data.

1.3.2 Unstructured data

The world isn’t made up of structured data, though; it’s imposed upon it by humans and
machines. More often, data comes unstructured. Example Email (fig.1.2)
 Unstructured data is data that isn’t easy to fit into a data model because the content is
context-specific or varying.
 Although email contains structured elements such as the sender, title, and body text, it’s a
challenge because thousands of different languages and dialects out there further
complicate this.

3
Figure 1.2 Email is an example of unstructured data and natural language data.

1.3.3 Natural language

Natural language is a special type of unstructured data; it’s challenging to process because
it requires knowledge of specific data science techniques and linguistics.
 The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis.
 But models trained in one domain don’t generalize well to other domains.
Even state-of-the-art techniques aren’t able to decipher the meaning of every piece of text.
 Even humans struggle with natural language as well. It’s ambiguous by nature.
 The concept of meaning itself is questionable here.
 The meaning of the same words can vary when coming from someone upset or joyous.

1.3.4Machine-generated data
Machine-generated data is information that’s automatically created by a computer,
process, application, or other machine without human intervention.
 Machine-generated data is becoming a major data resource and will continue to do so. Ikibon
has fore- cast that the market value of the industrial Internet (a term coined by
 The analysis of machine data relies on highly scalable tools, due to its high volume and speed.
Examples of machine data are web server logs, call detail records, network event logs, and
telemetry (figure 1.3).

Figure 1.3 Example of machine-generated data

The machine data is a classic table-structured database. This isn’t the best approach for highly
interconnected or “networked” data, where the relationships between entities have a valuable role to play.

1.3.5 Graph-based or network data

“Graph” points to mathematical graph theory. In graph theory, a graph is a mathematical
structure to model pair-wise relationships between objects.
 Graph or network data is, in short, data that focuses on the relationship or adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and store graphical data.

4
 Graph-based data is a natural way to represent social networks, and its structure allows you to
calculate specific metrics such as the influence of a person and the shortest path between two
people.
 Examples of graph-based data can be found on many social media websites (figure 1.4).

Figure 1.4 Friends in a social network are an example of graph-based data.

Your follower list on Twitter is another example of graph-based data.
 Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.
 Graph data poses its challenges, but for a computer interpreting additive and image data, it can
be even more difficult.

1.3.6 Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data scientist.
Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
 High-speed cameras at stadiums will capture ball and athlete movements to calculate in real
time.
 Recently a company called DeepMind succeeded at creating an algorithm that’s capable of
learning how to play video games.
 This algorithm takes the video screen as input and learns to interpret everything via a complex
process of deep learning.
 Google bought the company for Artificial Intelligence (AI) development plans. The learning
algorithm takes in data as it’s produced by the computer game; it’s streaming data.

1.3.7 Streaming data

Streaming data flows into the system when an event happens instead of being loaded into a
data store in a batch.
 Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market.

1.4 THE DATA SCIENCE PROCESS

The data science process typically consists of six steps.

5
(1) Setting the research goal
Data science is mostly applied in the context of an organization. When
the business asks you to perform a data science project, you’ll first prepare a project
charter.
 Charter contains information such as :
 what you’re going to research
 how the company benefits from that
 what data and resources you need
 timetable
 deliverables

(2) Data Retrieval

In this step the data is collected.

 In the charter, it is stated which data you need and where you can find it.
 In this step it is ensured that the data in your program can be used, which means checking the
existence of, quality, and access to the data.
 Data can also be delivered by third-party companies and takes many forms ranging from
Excel spreadsheets to different types of databases.

(3) Data preparation

Data collection is an error-prone process; in this phase you enhance the quality of
the data and prepare it for use in subsequent steps.
It consists of three sub- phases:
1. data cleansing: removes false values from a data source and inconsistencies across
data sources,
2. data integration: enriches data sources by combining information from multiple
data sources
3. data transformation: ensures that the data is in a suitable format for use in your
models.

(4) Data exploration

Data exploration is concerned with building a deeper understanding of your data.
We can try to understand how variables interact with each other, the distribution
of the data, and whether there are outliers.
To achieve this you mainly use:
 descriptive statistics
 visual techniques
 simple modeling
This step is called Exploratory Data Analysis (EDA).

(5) Data modeling or model building

In this phase models, domain knowledge, and insights about the data found
in the previous steps are used to answer the research question.
It has the following steps:
 Select a technique from the fields of statistics, machine learning, operations
research, and so on.
 Building a model is an iterative process that involves selecting the variables for the
model, executing the model, and model diagnostics.
6
(6) Presentation and automation
Finally, you present the results to your business. These results can take many
forms, ranging from presentations to research reports.
Sometimes automation of the execution of the process is needed because the
business will want to use the insights you gained in another project or enable an
operational process to use the outcome from your model.

An Iterative Process
o the data science process is not always linear
o in reality we often have to step back and rework certain findings
o we might gain incremental insights, which may lead to new questions.
o To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.

Fig. Overview of Data Science process

7
1.5 SETTING THE RESEARCH GOAL (Step 1)
A project starts by understanding the below three,
What
Why
How

 Answering these questions (what, why, how) is the goal of the first phase, so that everybody
knows what to do and can agree on the best course of action.
 Outcome should be a
 clear research goal
 a good understanding of the context
 well-defined deliverables
 plan of action with a timetable
 This information is then best placed in a project charter.
 The length and formality can, of course, differ between projects and companies.
 In this early phase of the project, people skills and business acumen are more
important than great technical prowess,
 This phase is often be guided by more senior personnel.
1.5.1 Spend time understanding the goals and context of your research
An essential outcome is the research goal as it states the purpose of your
assignment in a clear and focused manner
 Understanding the business goals and context is critical for project success.
 Continue asking questions and devising examples until you grasp the exact business
expectations,
 Identify how your project fits in the bigger picture, appreciate how your research is
going to change the business
 Understand how they’ll use your results.
 Many data scientists fail here: despite their mathematical wit and scientific brilliance, they
never seem to grasp the business goals and context.

1.5.2 Create a project charter

Clients like to know upfront what they’re paying for.
o After you have a good understanding of the business problem, try to get a formal
agreement on the deliverables.
o All this information is best collected in a project charter.
o For any significant project this would be mandatory.
o A project charter requires teamwork, and your input covers at least the following:
 A clear research goal
 The project mission and context
8
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline
The client can use this information to make an estimation of the project costs and the
data and people required for the project to become a success.

1.6 RETRIEVING DATA (Step 2)

The second step in data science is to retrieve the required data (figure 2.3).
Many companies will have already collected and stored the data, and what they don’t
have can often be bought from third parties.
We need to look outside your organization for data, because more and more organizations
are making even high-quality data freely available for public and commercial use.

 Data can be stored in many forms, ranging from simple text files to tables in a database.
 The objective now is acquiring all the data you need.
 Data is often like a diamond in the rough: it needs polishing to be of any use to you.

1.6.1 Start with data stored within the company

 First act should be to assess the relevance and quality of the data that’s readily available within
your company.
 Most companies have a program for maintaining key data, so much of the cleaning work may
already be done.
 This data can be stored in official data repositories that are maintained by a team of IT
professionals, such as :
o Databases – designed for data storage.
o Data marts - is a subset of the data warehouse and geared toward serving a specific business
unit
o Data warehouses- designed for reading and analyzing that data.
o Data lakes - contains data in its natural or raw format,
While data warehouses and data marts are home to preprocessed data, data lakes.
Challenges in finding data in our own company:
 As companies grow, their data becomes scattered around many places.
 Knowledge of the data may be dispersed as people change positions and leave the company.
 Documentation and metadata aren’t always the top priority of a delivery manager,.
Chinese Walls:
 Organizations understand the value and sensitivity of data and often have policies in place so
everyone has access to what they need and nothing more.
 These policies translate into physical and digital barriers called Chinese walls.
9
 These “walls” are mandatory and well-regulated for customer data in most countries.
 This is for good reasons, too; imagine everybody in a credit card company having access to
your spending habits.
 Getting access to the data may take time and involve company politics.

1.6.2 Searching data outside organization

Many companies specialize in collecting valuable information. Eg. Nielsen and GFK are well
known for this in the retail industry.
o Other companies provide data so that you, in turn, can enrich their services and ecosystem.
Such is the case with Twitter, LinkedIn, and Facebook.
o Governments and organizations share their data for free with the world.
o This data can be of excellent quality; it depends on the institution that creates and manages it.
o Eg., information they share can be on the number of accidents or amount of drug abuse and its
demographics.
This data is helpful when you want to enrich proprietary data but also convenient when training
your data science skills at home.

Open data site Description

[Link] The home of the US Government’s open data
[Link] An open database that retrieves its information from sites
like Wikipedia, MusicBrains, and the SEC archive
[Link] Open data initiative from the World Bank
[Link] Open data for international development
[Link] Open data from the US Food and Drug Administration

Table 2.1 A list of open-data providers that should get you started

Data Quality Checks

o Most of the errors you’ll encounter during the data- gathering phase are easy to spot, but
being too careless will make you spend many hours solving data issues that could have been prevented
during data import.
o Investigate the data during the import, data preparation, and exploratory phases.
o During data retrieval, you check to see if the data is equal to the data in the source document
and look to see if you have the right data types.
 When you have enough evidence that the data is similar to the data you find in the source
document, you stop.
o During data preparation, you do a more elaborate check.
 If you did a good job during the previous phase, the errors you find now are also
present in the source document.
 The focus is on the content of the variables: you want to get rid of typos and other data entry
errors and bring the data to a common standard among the data sets.
o During the exploratory phase your focus shifts to what you can learn from the data.
 Assume the data to be clean and look at the statistical properties such as distributions,
correlations, and outliers.
 It is often iterate over these phases.

10
1.7 CLEANSING, INTEGRATING, AND TRANSFORMING DATA (Step 3)
 The data received from the data retrieval phase is likely to be “a diamond in the rough.”
 The task now is to sanitize and prepare it for use in the modeling and reporting phase.
 Doing so is tremendously important because models will perform better and we’ll lose less
time trying to fix strange output.
 Your model needs the data in a specific format, so data transformation will always come into
play.
 It’s a good habit to correct data errors as early on in the process as possible; which might not
be always possible, but steps should be taken.
1.7.1 Cleansing data
Data cleansing is a subprocess of the data science process that focuses on removing errors in
your data so your data becomes a true and consistent representation of the processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist. They are,
1. Interpretation error:
Such as when you take the value in your data for granted, like saying that a person’s age is
greater than 300 years.
2. Inconsistencies between data sources or against your company’s standardized values:
An example of this class of errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
At the data cleansing stage, these advanced methods are, however, rarely applied and
often regarded by certain data scientists as overkill.

ERRORS
I. DATA ENTRY ERRORS
Data collection and data entry are error-prone processes.
a) Human intervention
They can make typos or lose their concentration for a second and introduce an
error into the chain.
b) Machine/Computers
 Data collected by machines or computers isn’t free from errors either.
 Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure.
 Examples of errors originating from machines are transmission errors or bugs in the
extract, transform, and load phase.
 For small data sets you can check every value by hand.

11
 Detecting data errors when the variables you study don’t have many classes can be
done by tabulating the data with counts.
 When you have a variable that can take only two values: “Good” and “Bad”, we
can create a frequency table and see if those are truly the only two values present.
 In table 2.3, the values “Godo” and “Bade” point out something went wrong in at
least 16 cases.
Table 2.3 Detecting outliers on simple variables with a frequency table
Value Count
Good 1598647
Bad 1354468
Godo 15
Bade 1
II. REDUNDANT WHITESPACE
 Whitespaces tend to be hard to detect but cause errors like other redundant characters
would.
 Example, whitespaces at the end of a string, can be a bug, which can be difficult to be
found.
 After looking for days through the code, you finally find the bug.
 Then comes the hardest part: explaining the delay to the project stakeholders.
 The cleaning during the ETL(Extract, Transorm and Load) phase wasn’t well executed, and
keys in one table contained a whitespace at the end of a string.
 This caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that
couldn’t be matched.
 If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in
most programming languages.
 They all provide string functions that will remove the leading and trailing whitespaces.
 For instance, in Python you can use the strip()function to remove leading and trailing spaces.

Fixing Capital Letter Mismatches

Capital letter mismatches are common.
o Most programming languages make a distinction between “Brazil” and “brazil”.
In this case you can solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python. “Brazil”.lower() == “brazil”.lower()should result in true.

III. IMPOSSIBLE VALUES AND SANITY CHECKS

Check the value against physically or theoretically impossible values such as people
taller than 3 meters or someone with an age of 299 years. Sanity checks can be directly expressed
with rules:
check = 0 <= age <= 120

IV. OUTLIERS
An outlier is an observation that seems to be distant from other observations or,
more specifically, one observation that follows a different logic or generative process
than the other observations.
 Outliers may be exceptions that stand outside individual samples of populations.
 An outlier is a data point that is noticeably different from the rest.

12
 The easiest way to find outliers is to use a plot or a table with the minimum and
maximum values. An example is shown in figure 2.6.
The normal distribution, or Gaussian distribution, is the most common
distribution in natural sciences.

Which technique to use at what time is dependent on your particular case.

V. Deviations from a code book

 Detecting errors in larger data sets against a code book or against standardized values
can be done with the help of set operations.
 A code book is a description of your data, a form of metadata.
o It contains things such as the number of variables per observation, the
number of observations, and what each encoding within a variable means.

13
o A code book also tells the type of data you’re looking at: is it hierarchical,
graph, something else?
 If you have multiple values to check, it’s better to put them from the code book into a
table and use a difference operator to check the discrepancy between both tables.

VI. Different units of measurement

When integrating two data sets, you have to pay attention to their respective units of
measurement.
An example of this would be when you study the prices of gasoline in the world. To do this
you gather data from different data providers. Data sets can contain prices per gallon and others
can contain prices per liter. A simple conversion will do the trick in this case.

VII. Different levels of aggregation

 Having different levels of aggregation is similar to having different types of
measurement.
 An example of this would be a data set containing data per week versus one containing
data per work week.
o This type of error is generally easy to detect, and summarizing the data sets will fix it.
After cleaning the data errors, you combine information from different data sources.

1.7.2 Correct errors as early as possible

A good practice is to mediate data errors as early as possible in the data collection
chain and to fix as little as possible inside your program while fixing the origin of the problem.
 The data collection process is error- prone, and in a big organization it involves many
steps and teams.
 Data should be cleansed when acquired for many reasons:
■ Not everyone spots the data anomalies.
■ If errors are not corrected early on in the process, the cleansing will have to be done for
every project that uses that data.
■ Data errors may point to a
o business process that isn’t working as designed
o defective equipment, such as broken transmission lines and defective sensors.
o bugs in software or in the integration of software that may be critical to the
company.
 As a final remark: always keep a copy of your original data.

1.7.3 Combining data from different data sources

Data comes from several different places, and in this substep we focus on integrating these
different sources.
Data varies in size, type, and structure, ranging from databases and Excel files to text
documents.
The Different Ways Of Combining Data
There are two operations to combine information from different data sets.
1. Joining: enriching an observation from one table with information from another
table.
2. Appending or Stacking: adding the observations of one table to those of another table.

14
 When we combine data, you have the option to create a new physical table or
a virtual table by creating a view.
 The advantage of a view is that it doesn’t consume more disk space

 Joining tables
Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table.

Example: To join tables, you use variables that represent the same object in both tables,
such as a date, a country name, or a Social Security number.
These common fields are known as keys.
When these keys also uniquely define the records in the table they are called primary
keys.
 Appending Tables
Appending or stacking tables is effectively adding observations from one
table to another table.

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

To avoid duplication of data, you virtually combine data with views.
A view behaves as if you’re working on a table, but this table is nothing but a virtual layer
that combines the tables for you.
Drawback:
While a table join is only performed once, the join that creates the view is recreated every
time it’s queried, using more processing power than a pre-calculated table would have.

15
ENRICHING AGGREGATED MEASURES
Data enrichment can also be done by adding calculated information to the table.
Extra measures such as these can add perspective.

Transforming Data
Certain models require their data to be in a certain shape.
o Now since the data has been cleansed and integrated the data, this is the next task we can
perform: transforming the data so it takes a suitable form for data modeling.
o Relationships between an input variable and an output variable aren’t always linear.
o Take, for instance, a relationship of the form y = aebx.
o Taking the log of the independent variables simplifies the estimation problem dramatically.

REDUCING THE NUMBER OF VARIABLES

Having too many variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input variables.
16
 Data scientists use special methods to reduce the number of variables but retain
the maximum amount of data.
 Reducing the number of variables makes it easier to understand the key values.

TURNING VARIABLES INTO DUMMIES

- Variables can be turned into dummy variables.
- Dummy variables can only take two values: true(1) or false(0).
- They’re used to indicate the absence of a categorical effect that may explain the
observation.
- In this case you’ll make separate columns for the classes stored in one variable and
indicate it with 1 if the class is present and 0 otherwise.
- An example is turning one column named Weekdays into the columns Monday through
Sunday.
- You use an indicator to show if the observation was on a Monday; you put 1 on Monday
and 0 elsewhere.
- Turning variables into dummies is a technique that’s used in modeling and is popular
with, but not exclusive to, economists.

1.8. EXPLORATORY DATA ANALYSIS (Step 4)

o During exploratory data analysis you take a deep dive into the data.
o Information becomes much easier to grasp when shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of your data and the interactions between variables.
o Some anomalies may still be left out, thus forcing to take a step back and fix them.

17
o The visualization techniques you use in this phase range from simple line graphs or
histograms, complex diagrams such as Sankey and network graphs.
o Sometimes it’s useful to compose a composite graph from simple graphs to get even
more insight into the data.
o Other times the graphs can be animated or made interactive to make it easier.
Bar Chart:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular
bars with heights or lengths proportional to the values that they represent.

Fig. Bar Chart

Distribution plot:
Distribution plots visually assess the distribution of sample data by comparing
the empirical distribution of the data with the theoretical values expected from a specified
distribution.

Fig. Distribution plot

Line plot:
A line plot, also called a dot plot, is a graph that shows the frequency, or the number of
times, a value occurs in a data set.

Fig. Line plot

18
Multiple plots:
These plots can be combined to provide even more insight.

Fig. Multiple plots

A Pareto diagram:
A Pareto chart is a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the cumulative total is represented by the line.

Fig. Pareto diagram

Brushing and Linking:
With brushing and linking you combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the other graphs.

19
Histogram:
A variable is cut into discrete categories and the number of occurrences in each category
are summed up.

Boxplot:
The boxplot, on the other hand, doesn’t show how many observations are present but does
offer an impression of the distribution within categories. It can show the maximum, minimum,
median, and other characterizing measures at the same time.

Sankey Diagram:
A sankey diagram is a visualization used to depict a flow from one set of values to another.
The things being connected are called nodes and the connections are called links.

Fig. Sankey Diagram

Network Graph:
Network diagrams (also called Graphs) show interconnections between a set of entities.
Each entity is represented by a Node (or vertice). Connections between nodes are represented
through links (or edges).

Fig. Network Graph

20
1.9. BUILD THE MODELS. (STEP 5)
With clean data in place and a good understanding of the content, we can start to build
models with the goal of making better predictions, classifying objects, or gaining an
understanding of the system that we are modeling.
This phase is much more focused than the exploratory analysis step, because we know
what we’re looking for and what the outcome should be.

o The techniques used are borrowed from the field of machine learning, data mining,
and/or statistics.
o Building a model is an iterative process.
o The way you build your model depends on whether you go with classic statistics or
the somewhat more recent machine learning school, and the type of technique to be
used.
Either way, most models consist of the following main steps:
a. Model and variable selection
b. Model execution
c. Model diagnostics and model comparison

1.9.1 Model and variable selection

It is a need to select the variables to include in your model and a modeling technique.
 The findings from the exploratory analysis should already give a fair idea of what variables will
help you construct a good model.
 Many modeling techniques are available, and choosing the right model for a problem requires
judgment on your part.
Consider model performance and whether the project meets all the requirements to use the model, as
well as other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to
implement?
■ How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
1.9.2 Model execution
Once the model is chosen, we need to implement it in code.
 Python, has libraries such as StatsModels or Scikit-learn for doing this.
These packages use several of the most popular techniques.
 Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process.
The following listing shows the execution of a linear prediction model.

21
The [Link]() outputs the table.

■ Model fit :
 For this the R-squared or adjusted R-squared is used.
 This measure is an indication of the amount of variation in the
data that gets captured by the model.
 The difference between the adjusted R-squared and the R-squared
is minimal here because the adjusted one is the normal one + a
penalty for model complexity.

22
 A model gets complex when many variables (or features) are
introduced.
 You don’t need a complex model if a simple model is available, so
the adjusted R-squared punishes you for overcomplicating.
 For models in businesses, models above 0.85 are often considered
good. High 90s is very good.
 For research however, often very low model fits (<0.2 even) are
found.
■ Predictor variables have a coefficient:
 For a linear model this is easy to interpret. In our example if you add
“1” to x1, it will change y by “0.7658”.
 Coefficients are great, but sometimes not enough evidence exists to
show that the influence is there.
■ Predictor significance :
 If, for instance, you determine that a certain
gene is significant as a cause for cancer, this is
important knowledge.
 detecting influences is more important in scientific
studies than perfectly fitting models
 But when do we know a gene has that impact?
 This is called significance.
 This is what the p-value is about.
k-nearest neighbors.
Linear regression works, to predict a value, but if we want to classify something? Then
we need to f o r classification models, the best known among them being k-nearest neighbors.
k-nearest neighbors looks at labeled points nearby an unlabeled point and,
based on this, makes a prediction of what the label should be.

23
[Link]() : returns the model accuracy, but by “scoring a model” we often mean
applying it on data to make a prediction.

prediction = [Link](predictors)

Now we can use the prediction and compare it to the real thing using a confusion
matrix.

metrics.confusion_matrix(target,prediction)

Confusion matrix :

It is a table that is used to define the performance of a classification algorithm.

 It’s fairly easy to use models that are available in R within Python with the help of the RPy
library.
 RPy provides an interface from Python to R.
 R is a free software environment, widely used for statistical computing.

1.9.3 Model diagnostics and model comparison

We will be building multiple models from which we can choose the best one based on
multiple criteria. Working with a holdout sample helps in picking the best-performing model.

24
A holdout sample is a part of the data you leave out of the model building so it can be
used to evaluate the model afterward.

o The principle here is simple: the model should work on unseen data.
o You use only a fraction of your data to estimate the model and the other part, the holdout
sample, is kept out of the equation.
o The model is then unleashed on the unseen data and error measures are calculated to evaluate it.
o The error measure used in the example is the mean square error.

o Mean square error is a simple measure: check for every prediction how far it
was from the truth, square this error, and add up the error of every prediction.

 We use 800 randomly chosen observations out of 1,000 (or 80%), without
showing the other 20% of data to the model.
 Once the model is trained, we predict the values for the other 20% of the
variables based on those for which we already know the true value, and calculate
the model error with an error measure.
 Then we choose the model with the lowest error.
 In this example we chose model 1 because it has the lowest total error.
 Many models make strong assumptions, such as independence of the inputs, and
we have to verify that these assumptions are indeed met.
 This is called model diagnostics.
 Once we have a working model we are ready to go to the last step.

1.10. PRESENTING THE BUILDING APPLICATIONS

25
After the data is analyzed the data and a well-performing model is built, the findings are being
ready to present to the world.

 We can explain what we have found to the stakeholders.

o Sometimes we need to repeat our work over and over again because of the predictions of
the models or the insights that we produced.
o For this reason, it is necessary to automate our models.
o This doesn’t always mean that you have to redo all of your analysis all the time.
o Sometimes it’s sufficient to implement only the model scoring; or can build an
application that automatically updates reports, Excel spreadsheets, or PowerPoint
presentations.
o Our soft skills will be most useful at this stage.

Foundations of Data Science Overview
No ratings yet
Foundations of Data Science Overview
22 pages
Data Science - Mass-With Question Bank-3cs
No ratings yet
Data Science - Mass-With Question Bank-3cs
72 pages
CS3352 Data Science Syllabus Overview
No ratings yet
CS3352 Data Science Syllabus Overview
30 pages
Foundations of Data Science Syllabus
No ratings yet
Foundations of Data Science Syllabus
277 pages
FODS Unit 1 Slide
No ratings yet
FODS Unit 1 Slide
94 pages
Foundations of Data Science Syllabus
No ratings yet
Foundations of Data Science Syllabus
217 pages
Foundations of Data Science Syllabus
No ratings yet
Foundations of Data Science Syllabus
244 pages
Unit 1 Student Material
No ratings yet
Unit 1 Student Material
29 pages
Data Science and Big Data Overview
No ratings yet
Data Science and Big Data Overview
36 pages
Data Science and Big Data Essentials
No ratings yet
Data Science and Big Data Essentials
202 pages
Understanding Data Science and Big Data
No ratings yet
Understanding Data Science and Big Data
55 pages
Data Science
No ratings yet
Data Science
108 pages
Data Science Foundations Overview
No ratings yet
Data Science Foundations Overview
25 pages
FODS Unit 1
No ratings yet
FODS Unit 1
46 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
38 pages
Data Science and Big Data Overview
No ratings yet
Data Science and Big Data Overview
18 pages
Understanding Big Data and Data Science
No ratings yet
Understanding Big Data and Data Science
19 pages
Data Science Applications and Techniques
No ratings yet
Data Science Applications and Techniques
262 pages
Datascience 1
No ratings yet
Datascience 1
24 pages
Ocs 353
No ratings yet
Ocs 353
229 pages
Introduction to Data Science Overview
No ratings yet
Introduction to Data Science Overview
98 pages
Facets of Data in Data Science
50% (2)
Facets of Data in Data Science
22 pages
Introduction of Ds
No ratings yet
Introduction of Ds
49 pages
Data Science Process and Techniques
No ratings yet
Data Science Process and Techniques
75 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
39 pages
FODS Unit-1
No ratings yet
FODS Unit-1
33 pages
FDS Unit-1
No ratings yet
FDS Unit-1
33 pages
Data Science in Big Data Context
No ratings yet
Data Science in Big Data Context
44 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
60 pages
Understanding Data Science and Roles
No ratings yet
Understanding Data Science and Roles
11 pages
Data Science Honours Overview at DYPCOE
No ratings yet
Data Science Honours Overview at DYPCOE
115 pages
Data Science Fundamentals Explained
No ratings yet
Data Science Fundamentals Explained
26 pages
Understanding Data Science Fundamentals
No ratings yet
Understanding Data Science Fundamentals
135 pages
Data Science Fundamentals and Process
No ratings yet
Data Science Fundamentals and Process
32 pages
Big Data and Data Science Essentials
No ratings yet
Big Data and Data Science Essentials
42 pages
Overview of Data Science Fundamentals
100% (1)
Overview of Data Science Fundamentals
27 pages
Data Science Essentials Overview
No ratings yet
Data Science Essentials Overview
50 pages
Data Science: Benefits, Processes, and Types
No ratings yet
Data Science: Benefits, Processes, and Types
30 pages
Introduction to Data Science Essentials
No ratings yet
Introduction to Data Science Essentials
42 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
66 pages
Data Science and Big Data Overview
No ratings yet
Data Science and Big Data Overview
13 pages
Data Science Fundamentals and Process
No ratings yet
Data Science Fundamentals and Process
44 pages
Introduction to Big Data and Data Science
No ratings yet
Introduction to Big Data and Data Science
18 pages
Understanding Big Data and Data Science
No ratings yet
Understanding Big Data and Data Science
95 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
82 pages
Data Science Fundamentals and Process
No ratings yet
Data Science Fundamentals and Process
26 pages
Data Science Fundamentals Overview
No ratings yet
Data Science Fundamentals Overview
164 pages
Fundamentals of Data Science Overview
80% (5)
Fundamentals of Data Science Overview
62 pages
Data Science Fundamentals Overview
No ratings yet
Data Science Fundamentals Overview
34 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
Applications and Facets of Data Science
No ratings yet
Applications and Facets of Data Science
23 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
7 pages
For FDS
No ratings yet
For FDS
272 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
45 pages
Data Science: Insights and Applications
No ratings yet
Data Science: Insights and Applications
13 pages
Ds New 2
No ratings yet
Ds New 2
168 pages
Datascience Notes
No ratings yet
Datascience Notes
4 pages
CS3352 Notes
No ratings yet
CS3352 Notes
144 pages
Staging Data: From Unstructured to Structured
No ratings yet
Staging Data: From Unstructured to Structured
27 pages
Unit 4 Applied Physics II CSIE
No ratings yet
Unit 4 Applied Physics II CSIE
21 pages
Python Variable Manipulation and Sorting
No ratings yet
Python Variable Manipulation and Sorting
9 pages
Arduino Projects: LED, Relay, Motion Sensor
No ratings yet
Arduino Projects: LED, Relay, Motion Sensor
3 pages
Machine Learning Laboratory Manual 2025
No ratings yet
Machine Learning Laboratory Manual 2025
54 pages
Joining Report for Muslim Arts College
No ratings yet
Joining Report for Muslim Arts College
1 page
Examination Hall Rules for Students
No ratings yet
Examination Hall Rules for Students
7 pages
Understanding Design Thinking Principles
No ratings yet
Understanding Design Thinking Principles
17 pages
UG Courses at Manonmaniam Sundaranar University
No ratings yet
UG Courses at Manonmaniam Sundaranar University
13 pages
Data Science Lab Manual 2021 Regulation
No ratings yet
Data Science Lab Manual 2021 Regulation
111 pages
DPCO Unit 1 - Compressed
No ratings yet
DPCO Unit 1 - Compressed
72 pages
Java Exception Handling Basics
No ratings yet
Java Exception Handling Basics
48 pages
Overview of System Analysis and Design
No ratings yet
Overview of System Analysis and Design
6 pages
Statistical Analysis and Data Types
100% (1)
Statistical Analysis and Data Types
18 pages
R Lists and Data Frames Explained
No ratings yet
R Lists and Data Frames Explained
8 pages
SAP Admin
No ratings yet
SAP Admin
316 pages
CompTIA CySA+ Study Notes Overview
No ratings yet
CompTIA CySA+ Study Notes Overview
53 pages
CCTV Metadata for Video Content Filtering
No ratings yet
CCTV Metadata for Video Content Filtering
11 pages
Solidworks 2001 Teachers Complete Lessons
No ratings yet
Solidworks 2001 Teachers Complete Lessons
267 pages
Otis Key Device Management Tool Guide
No ratings yet
Otis Key Device Management Tool Guide
13 pages
Multi-layer Perceptron Overview
No ratings yet
Multi-layer Perceptron Overview
12 pages
Enhancing Load Forecasting with ANN
No ratings yet
Enhancing Load Forecasting with ANN
16 pages
Guide d'utilisation du téléviseur LED
No ratings yet
Guide d'utilisation du téléviseur LED
62 pages
Keysight - Radar Electromagnetic Spectrum Operatioms
100% (1)
Keysight - Radar Electromagnetic Spectrum Operatioms
315 pages
SQL View Creation and Examples
No ratings yet
SQL View Creation and Examples
4 pages
Prompt: Best Chatgpt
No ratings yet
Prompt: Best Chatgpt
13 pages
CISA/CRISC/CISM/CGEIT Exam Scheduling Guide
No ratings yet
CISA/CRISC/CISM/CGEIT Exam Scheduling Guide
14 pages
RCOM Comments on Mobile Data QoS Standards
No ratings yet
RCOM Comments on Mobile Data QoS Standards
9 pages
Computer Class 2 Overview
No ratings yet
Computer Class 2 Overview
44 pages
Lane Detection Using Deep Learning Techniques
No ratings yet
Lane Detection Using Deep Learning Techniques
54 pages
Fundamentals of Information Systems MCQ PDF
No ratings yet
Fundamentals of Information Systems MCQ PDF
15 pages
Odaseva Complete Guide To Salesforce Backup and Restore 2020
No ratings yet
Odaseva Complete Guide To Salesforce Backup and Restore 2020
36 pages
NALSAR e-Library Resource Guide
No ratings yet
NALSAR e-Library Resource Guide
7 pages
IMEI Randomization for Mobile Privacy
No ratings yet
IMEI Randomization for Mobile Privacy
2 pages
Rules on Electronic Evidence (A.M. No. 01-7-01-SC)
No ratings yet
Rules on Electronic Evidence (A.M. No. 01-7-01-SC)
6 pages
Mercy Health West Hospital Design Overview
No ratings yet
Mercy Health West Hospital Design Overview
8 pages
Overview of Azure SQL Services
No ratings yet
Overview of Azure SQL Services
9 pages
Present Simple and Past Simple Guide
No ratings yet
Present Simple and Past Simple Guide
18 pages
Shimadzu ICPMS-2030 Installation Guide
100% (1)
Shimadzu ICPMS-2030 Installation Guide
17 pages
Visual Basic Basic Controls Overview
No ratings yet
Visual Basic Basic Controls Overview
6 pages
Skale Wallet in Metaverse Game Pitch
No ratings yet
Skale Wallet in Metaverse Game Pitch
18 pages
Google Docs and Slides Formatting Guide
No ratings yet
Google Docs and Slides Formatting Guide
4 pages

FDSA Unit1

Uploaded by

FDSA Unit1

Uploaded by

UNIT I

INTRODUCTION TO DATA SCIENCE

1.1 NEED FOR DATA SCIENCE

Characteristics of big data

Fig. 5 Vs of Big Data

It often referred to as the five Vs:

Challenges of big data:

Role of Data Science

1.2 BENEFITS AND USES OF DATA SCIENCE

(i) Commercial companies

(ii) Human resource professionals

(iii) Financial institutions

(iv) Governmental Organizations

1.3. FACETS OF DATA

1.3.1 Structured data

Figure 1.1 An Excel table is an example of structured data.

1.3.2 Unstructured data

1.3.3 Natural language

Figure 1.3 Example of machine-generated data

1.3.5 Graph-based or network data

Figure 1.4 Friends in a social network are an example of graph-based data.

1.3.6 Audio, image, and video

1.3.7 Streaming data

1.4 THE DATA SCIENCE PROCESS

(2) Data Retrieval

In this step the data is collected.

(3) Data preparation

(4) Data exploration

(5) Data modeling or model building

Fig. Overview of Data Science process

1.5.2 Create a project charter

1.6 RETRIEVING DATA (Step 2)

1.6.1 Start with data stored within the company

1.6.2 Searching data outside organization

Open data site Description

Data Quality Checks

Fixing Capital Letter Mismatches

III. IMPOSSIBLE VALUES AND SANITY CHECKS

Which technique to use at what time is dependent on your particular case.

V. Deviations from a code book

VI. Different units of measurement

VII. Different levels of aggregation

1.7.2 Correct errors as early as possible

1.7.3 Combining data from different data sources

USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

REDUCING THE NUMBER OF VARIABLES

TURNING VARIABLES INTO DUMMIES

1.8. EXPLORATORY DATA ANALYSIS (Step 4)

Fig. Bar Chart

Fig. Distribution plot

Fig. Line plot

Fig. Multiple plots

Fig. Pareto diagram

Fig. Sankey Diagram

Fig. Network Graph

1.9.1 Model and variable selection

It is a table that is used to define the performance of a classification algorithm.

1.9.3 Model diagnostics and model comparison

1.10. PRESENTING THE BUILDING APPLICATIONS

 We can explain what we have found to the stakeholders.

You might also like