Data Science - Mass-With Question Bank-3cs
Data Science - Mass-With Question Bank-3cs
Unit I:
Introduction: Benefits and uses Facts of data Data science process Big data ecosystem
and data science.
Unit II:
The Data science process: Overview research goals - retrieving data - transformation
Exploratory Data Analysis Model building.
Unit III:
Algorithms: Machine learning algorithms Modeling process Types Supervised
Unsupervised - Semi-supervised.
Unit IV:
Introduction to Hadoop: Hadoop framework Spark replacing MapReduce NoSQL
ACID CAP BASE types.
Unit V:
Case Study: Prediction of Disease - Setting research goals - Data retrieval preparation -
exploration - Disease profiling - presentation and automation
Text Book:
publications 2016.
Reference Books:
1.
2. Making Sense of Data with
A -book.
3.
4.
A
5.
O'Reilly Media 2013.
6. .
UNIT I
Introduction: Benefits and uses Facts of data Data science process Big data ecosystem
and data science.
INTRODUCTION:
What is Big data and Data science?
Big data is a blanket term for any collection of data sets so large or complex that it
becomes difficult to process them using traditional data management techniques such as, for
example, the RDBMS (relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the demands of handling
big data have shown otherwise. Data science involves using methods to analyze massive
amounts of data and extract the knowledge it contains. You can think of the relationship
between big data and data science as being like the relationship between crude oil and an oil
refinery. Data science and big data evolved from statistics and traditional data management
but are now considered to be distinct disciplines.
Data science and big data are used almost everywhere in both commercial and
noncommercial settings. The number of use cases is vast.
1. Commercial companies in almost every industry use data science and big data to gain
insights into their customers, processes, staff, completion, and products. Many companies use
data science to offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings.
A good example of this is Google AdSense, which collects data from internet users so
relevant commercial messages can be matched to the person browsing the internet. MaxPoint
([Link]
2. Governmental organizations
organizations not only rely on internal data scientists to discover valuable information, but
also share their data with the public. You can use this data to gain insights or build data-
driven applications. Example: [Link] is but one
3. Nongovernmental organizations (NGOs) are also no strangers to using data. They use it
to raise money and defend their causes. The World Wildlife Fund (WWF), for instance,
employs data scientists to increase the effectiveness of their fundraising efforts. Many data
scientists devote part of their time to helping NGOs, because NGOs often lack the resources
to collect data and employ data scientists. Data Kind is one such data scientist group that
devotes its time to the benefit of mankind.
4. Universities use data science in their research but also to enhance the study experience of
their students. The rise of massive open online courses (MOOC) produces a lot of data, which
allows universities to study how this type of learning can complement traditional classes.
Facets of data
of them tends to require different tools and techniques. The main categories of data are these:
-generated
-based
Structured data:
Structured data is data that depends on a data model and resides in a fixed field within
a re
files (figure 1.1). SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases. You may also come across structured data that might
give you a hard time storing it in a traditional relational database. Hierarchical data such as a
imposed upon it by humans and machines. More often, data comes unstructured.
Unstructured data:
-
specific or varying. One example of unstructured data is your regular email (figure 1.2). Although
email contain
number of people who have written an email complaint about a specific employee because so many
ways exist to refer to a person, for example. The thousands of different languages and dialects out
there further complicate this. A human-written email, as shown in figure 1.2, is also a perfect example
of natural language data.
Natural language:
to process
because it requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models trained
-of-the-
Machine-generated data:
Machine-
process, application, or other machine without human intervention. Machine-generated data
is becoming a major data resource and will continue to do so. Wikibon has forecast that the
market value of the industrial Internet (a term coined by Frost & Sullivan to refer to the
integration of complex physical machinery with networked sensors and software) will be
approximately $540 billion in 2020. IDC (International Data Corporation) has estimated there
will be 26 times more connected things than people in 2020. This network is commonly
referred to as the internet of things. The analysis of machine data relies on highly scalable
tools, due to its high volume and speed. Examples of machine data are web server logs, call
detail records, network event logs, and telemetry (figure 1.3).
Graph-based or network data:
Examples of graph-based data can be found on many social media websites (figure
1.4). For instance, on LinkedIn you can see who you know at which company. Your follower
list on Twitter is another example of graph-based data. The power and sophistication comes
from multiple, overlapping graphs of the same nodes. For example, imagine the connecting
which connect business colleagues via LinkedIn. Imagine a third graph based on movie
interests on Netflix. Overlapping the three different-looking graphs makes more interesting
questions possible.
Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to
be challenging for computers. MLBAM (Major League Baseball Advanced Media)
purpose of live, in-game analytics. High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example, the path taken by a defender relative
to two baselines.
capable of learning how to play video games. This algorithm takes the video screen as input
feat that prompted Google to buy the company for their own Artificial Intelligence (AI)
Streaming data:
While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being loaded into
such because you need to adapt your process to deal with this type of information. Examples
Data science is mostly applied in the context of an organization. When the business
that, what data and resources you need, a timetable, and deliverables.
2. Retrieving data:
need and where you can find it. In this step you ensure that you can use the data in your
program, which means checking the existence of, quality, and access to the data. Data can
also be delivered by third-party companies and takes many forms ranging from Excel
spreadsheets to different types of databases.
3. Data preparation:
Data collection is an error-prone process; in this phase you enhance the quality of the
data and prepare it for use in subsequent steps. This phase consists of three subphases: data
cleansing removes false values from a data source and inconsistencies across data sources,
data integration enriches data sources by combining information from multiple data sources,
and data transformation ensures that the data is in a suitable format for use in your models.
4. Data exploration:
Data exploration is concerned with building a deeper understanding of your data. You
try to understand how variables interact with each other, the distribution of the data, and
whether there are outliers. To achieve this you mainly use descriptive statistics, visual
techniques, and simple modeling. This step often goes by the abbreviation EDA, for
Exploratory Data Analysis.
In this phase you use models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique from the
fields of statistics, machine learning, operations research, and so on. Building a model is an
iterative process that involves selecting the variables for the model, executing the model, and
model diagnostics.
6. Presentation and automation:
Finally, you present the results to your business. These results can take many forms,
execution of the process because the business will want to use the insights you gained in
another project or enable an operational process to use the outcome from your model.
l dedicate a separate chapter to the most important data science technology classes.
The mind map in figure 1.6 shows the components of the big data ecosystem and
A distributed file system is similar to a normal file system, except that it runs on
They can store files larger than any one computer disk.
Files get automatically replicated across multiple servers for redundancy or par allel
operations while hiding the complexity of doing so from the user.
Once you have the data stored on the distributed file system, you want to exploit it.
Once you have a distributed file system in place, you need to add data. You need to
move data from one source to another, and this is where the data integration frame works
such as Apache Sqoop and Apache Flume excel. The process is similar to an extract,
transform, and load process in a traditional data warehouse.
where you rely on the fields of machine learning, statistics, and applied mathematics. One of
data we need to analyze today, this becomes problematic, and specialized frameworks and
libraries are required to deal with this amount of data. The most popular machine-learning
library for Python is Scikit- -
in the book. There are, of course, other Python libraries:
PyBrain for neural networks - Neural networks are learning algorithms that
mimic the human brain in learning mechanics and complexity. Neural networks
are often regarded as advanced and black box.
NLTK or Natural Language Toolkit - As the name suggests, its focus is working
with natu
ber of text corpuses to help you model your own data.
Pylearn2 Another machine learning toolbox but a bit less mature than Scikit-
learn,
TensorFlow A Python library for deep learning provided by Google.
5. NoSQL databases
managing and querying this data. Traditionally this has been the playing field of relational
databases such as Oracle SQL, MySQL, Syba -to
technology for many use cases, new types of databases have emerged under the grouping of
NoSQL databases.
Many different types of databases have arisen, but they can be categorized into the
following types:
7. Benchmarking tools
This class of tools was developed to optimize your big data installation by providing
standardized profiling suites. A profiling suite is taken from a representative set of big data
jobs. Benchmarking and optimizing the big data infrastructure and configura
jobs for data scientists themselves but for a professional specialized in setting up IT
make a big cost difference. For example, if you can gain 10% on a cluster of 100 servers, you
save the cost of 10 servers.
8. System deployment
Setting up a big
deploying new applications into the big data cluster is where system deployment tools shine.
They largely automate the installation and configuration of big data compo
core task of a data scientist.
9. Service programming
-class soccer prediction application on Hadoop, and
you want to allow others to use the predictions made by your application. However, you have
no idea of the architecture or technology of everyone keen on using your predictions. Service
tools excel here by exposing big data applications to other applications as a service. Data
scientists sometimes need to expose their models through services. The best-known example
is the REST service; REST stands for representa
websites with data.
10. Security
Big data security tools allow you to have central and fine-grained control over access
to the data. Big data security has become a topic in its own right, and data scientists are
usually only confronted with it as data consumers; seldom will they implement the security
The Data science process: Overview - research goals - retrieving data - transformation
Exploratory Data Analysis - Model building.
Structured approach to data science helps you to maximize your chances of success in
a data science project at the lowest cost. It also makes it possible to take up a project as a
team, with each team member focusing on what they do best. Take care, however: this
approach may not be suitable for every type of project or be the only way to do good data
science. The typical data science process consists of six ate, as
shown in figure 2.1.
Figure 2.1 summarizes the data science process and shows the main steps and actions
t.
1. The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the project. In
every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so
this step includes finding suitable data and getting access to the data from the data
owner. The result is data in its raw form, which probably needs polishing and
transformation before it becomes usable.
3.
on visual and descriptive techniques. The insights you gain from this phase will
enable you to start modeling.
5. Model building
that you attempt to gain the insights or make the predictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember research has
taught us that often (but not always) a combination of simple models tends to
done.
6. The last step of the data science model is presenting your results and automating
the analysis, if needed. One goal of a project is to change a process and/or make
better decisions. You may still need to convince the business that your findings will
indeed change the business process as expected. This is where you can shine in your
influencer role. The importance of this step is more apparent in projects on a strategic
and tactical level. Certain projects require you to perform the business process over
and over again, so automating the project will save time.
A project starts by understanding the what, the why, and the how of your project
(figure 2.2). What does the company expect you to do? And why does management place
originating from an opportunity someone detected? Answering these three questions (what,
why, how) is the goal of the first phase, so that everybody knows what to do and can agree on
the best course of action.
The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables, and a plan of action with a timetable. This information is then best
placed in a project charter. The length and formality can, of course, differ between projects
and companies. In this early phase of the project, people skills and business acumen are more
important than great technical prowess, which is why this part will often be guided by more
senior personnel.
An essential outcome is the research goal that states the purpose of your assignment in
a clear and focused manner. Understanding the business goals and context is critical for
project success. Continue asking questions and devising examples until you grasp the exact
business expectations, identify how your project fits in the bigger picture, appreciate how
l use your results.
Nothing is more frustrating than spending months researching something until you have that
one moment of brilliance and solve the problem, but when you report your findings back to
the organization, everyone immediately realizes that yo
skim over this phase lightly. Many data scientists fail here: despite their mathematical wit and
scientific brilliance, they never seem to grasp the business goals and context.
Create a project charter
A project charter requires teamwork, and your input covers at least the following:
Retrieving data
Sometimes you need to go into the field and design a data collection process yourself,
companies will have already
col
more organizations are making even high-quality data freely available for public and
commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a data
base. The objective now is acquiring all the data you need. This may be difficult, and even if
you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to
you.
available within your company. Most companies have a program for maintaining key data; so
much of the cleaning work may already be done. This data can be stored in official data
repositories such as databases, data marts, data warehouses, and data lakes maintained by a
team of IT professionals. The primary goal of a database is data storage, while a data
warehouse is designed for reading and analyzing that data. A data mart is a subset of the data
warehouse and geared toward serving a specific business unit. While data warehouses and
data marts are home to preprocessed data, a data lake contains data in its natural or raw
format. But the possibility exists that your data still resides in Excel files on the desktop of a
domain expert.
Finding data even within your own company can sometimes be a challenge. As
companies grow, their data becomes scattered around many places. Knowledge of the data
may be dispersed as people change positions and leave the company. Documentation and
need to
develop some Sherlock Holmes like skills to find all the lost bits.
Getting access to data is another difficult task. Organizations understand the value and
sensitivity of data and often have policies in place so everyone has access to what they need
and nothing more. These policies translate into physical and digital barriers called Chinese
-regulated for customer data in most countries.
This is for good reasons, too; imagine everybody in a credit card company having access to
your spending habits. Getting access to the data may take time and involve company politics.
Many companies specialize in collecting valuable information. For instance, Nielsen and
GFK are well known for this in the retail industry. Other companies provide data so that you,
in turn, can enrich their services and ecosystem. Such is the case with Twitter, LinkedIn, and
Facebook. Although data is considered an asset more valuable than oil by certain companies,
more and more governments and organizations share their data for free with the world. This
data can be of excellent quality; it depends on the institution that creates and manages it. The
information they share covers a broad range of topics such as the number of accidents or
amount of drug abuse in a certain region and its demographics. This data is helpful when you
want to enrich proprietary data but also convenient when training your data science skills at
home. Table 2.1 shows only a small selection from the growing number of open-data
providers.
Expect to spend a good portion of your project time doing data correction and
cleans
phase are easy to spot, but being too careless will make you spend many hours solving data
issues that could have been prevented during data import.
phases. The difference is in the goal and the depth of the investigation. During data retrieval,
you check to see if the data is equal to the data in the source document and look to see if you
the data is similar to the data you find in the source document, you stop. With data
preparation, you do a more elaborate check. If you did a good job during the previous phase,
the errors you find now are also present in the source document. The focus is on the content
of the variables: you want to get rid of typos and other data entry errors and bring the data to
a common standard among the data sets. For example, you might correct USQ to USA and
United Kingdom to UK. During the exploratory phase your focus shifts to what you can learn
from the data. Now you assume the data to be clean and look at the statistical properties such
as distribu
when you discover outliers in the exploratory phase, they can point to a data entry error. Now
TRANSFORMING DATA
and i
a suitable form for data modeling.
Take, for instance, a relationship of the form y = aebx. Taking the log of the independent
variables simplifies the estimation problem dramatically. Figure 2.11 shows how
transforming the input variable greatly simplifies the estimation problem. Other times you
might want to combine two variables into a new variable.
REDUCING THE NUMBER OF VARIABLES
Sometimes you have too many variables and need to reduce the number because they
Data scientists use special methods to reduce the number of variables but retain the
these methods in chapter 3. Figure 2.12
shows how reducing the number of variables makes it easier to understand the key values. It
also shows how two variables account for 50.6% of the variation within the data set
Variables can be turned into dummy variables (figure 2.13). Dummy variables can
In this section we introduced the third step in the data science process-cleaning,
transforming, and integrating data-which changes your raw data into usable input for the
modeling phase. The next step in the data science process is to get a better understanding of
the content of the data and the relationships between the variables and observations; we
explore this in the next section.
During exploratory data analysis you take a deep dive into the data (see figure 2.14).
Information becomes much easier to grasp when shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of your data and the inter actions between
variables. This phase is about exploring data, so keeping your mind open and your eyes
peeled is essential during the exploratory data analysi
The visualization techniques you use in this phase range from simple line graphs or
histograms, as shown in figure 2.15, to more complex diagrams such as Sankey and network
more insight into the data. Other times the graphs can be animated or made interactive to
Figure 2.15 from top to bottom, a bar chart, a line plot, and a distribution are some of the graphs used in exploratory analysis.
These plots can be combined to provide even more insight, as shown in figure 2.16.
Figure 2.16 Drawing multiple plots together can help you understand the structure of your data over multiple variables.
Two other important graphs are the histogram shown in figure 2.19 and the boxplot
shown in figure 2.20.
In a histogram a variable is cut into discrete categories and the number of occurrences
in each category are summed up and shown in the graph. The boxplot, on the other hand,
distribution within categories. It can show the maximum, minimum, median, and other
characterizing measures at the same time.
Build the models
build models with the goal of making better predictions, classifying objects, or gaining an
mining, and/or statistics. In this chapter we only explore the tip of the iceberg of existing
techniques, while chapt
techniques will help you in 80% of the cases because techniques overlap in what they try to
accomplish. They often achieve their goals in similar but slightly different ways.
Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine learning school,
and the type of technique you want to use. Either way, most models consist of the following
main steps:
technique. Your findings from the exploratory analysis should already give a fair idea of what
variables will help you construct a good model. Many modeling techniques are available, and
consider model performance and whether your project meets all the requirements to use your
model, as well as other factors:
Must the model be moved to a production environment and, if so, would it be easy to
implement?
How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
Does the model need to be easy to explain?
Model execution
l need to implement it in code.
Most programming languages, such as Python, already have libraries such as StatsModels or
Scikit-learn. These packages use several of the most popular techniques. Coding a model is a
nontrivial task in most cases, so having these libraries available can speed up the process. As
StatsModels or Scikit-learn. Doing this your self would require much more effort even for the
simple techniques. The following listing shows the execution of a linear prediction model.
We, however, created the target variable, based on the predictor by adding a bit of
-fitting model. The
[Link]() outputs the table in figure 2.23. Mind you, the exact outcome depends on
the random variables you got.
Model fit - For this the R-squared or adjusted R-squared is used. This measure is an
indication of the amount of variation in the data that gets captured by the model. The
difference between the adjusted R-squared and the R-squared is minimal here because the
adjusted one is the normal one + a penalty for model complexity. A model gets complex
simple model is available, so the adjusted R-squared punishes you for overcomplicating. At
any rate, 0.893 is high, and it should be because we cheated. Rules of thumb exist, but for
models in businesses, models above 0.85 are often considered good. If you want to win a
competition you need in the high 90s. For research however, often very low model fits (<0.2
even) s more important there is the influence of the introduced predictor
variables.
Predictor variables have a coefficient - For a linear model this is easy to interpret.
finding a good predictor can be your route to a Nobel Prize even though your model as a
whole is rubbish. If, for instance, you determine that a certain gene is significant as a cause
for cancer, this is important knowledge, even if that gene in itself doesn
a person will get cancer. The example here is classification, not regression, but the point
remains the same: detecting influences is more important in scientific studies than perfectly
fitting models (not to mention more realistic). But when do we know a gene has that impact?
This is called significance.
Predictor significance - Coefficients are great, but sometimes not enough evidence
exists to show that the influence is there. This is what the p-value is about. A long
explanation about type 1 and type 2 mistakes is possible here but the short explanations
would be: if the p-value is lower than 0.05, the variable is considered significant for most
redictor
Linear regression works if you want to predict a value, but what if you want to
classify something? Then you go to classification models, the best known among them being
k-nearest neighbors.
For one, the classifier had but three options; marking the difference with last time
n
multiple criteria. Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used to
evaluate the model afterward. The principle here is simple: the model should work on unseen
data. You use only a fraction of your data to estimate the model and the other part, the
holdout sample, is kept out of the equation. The model is then unleashed on the unseen data
and error measures are calculated to evaluate it. Multiple error measures are available, and in
figure 2.26 we show the general idea on comparing models. The error measure used in the
example is the mean square error.
Many models make strong assumptions, such as independence of the inputs, and you
have to verify that these assumptions are indeed met. This is called model diagnostics.
This section gave a short introduction to the steps required to build a valid model.
UNIT III
Algorithms: Machine learning algorithms Modeling process Types Supervised
Unsupervised - Semi-supervised.
Key Characteristics
With engineering features, you must come up with and create possible predictors for
the model. This is one of the most important steps in the process because a model recombines
these features to achieve its predictions. Often you may need to consult an expert or the
appropriate literature to come up with meaningful features.
Certain features are the variables you get from a data set, as is the case with the
pro
find the features yourself, which may be scattered among different data sets. In several
projects we had to bring together more than 20 different data sources before we had the raw
data we required.
With the right predictors in place and a modeling technique in mind, you can progress
to model training. In this phase you present to your model data from which it can learn.
model validation.
Validating a model
Data science has many modeling techniques, and the question is which one is the right
one to use. A good model has two properties: it has good predictive power and it generalizes
Two common error measures in machine learning are the classification error rate for
classification problems and the mean squared error for regression problems. The
classification error rate is the percentage of observations in the test data set that your model
mislabeled; lower is better. The mean squared error measures how big the average error of
wrong prediction in one direction with a faulty prediction in the other direction. For example,
overestimating future turnover
by 5,000 for the following month. As a second consequence of squaring, bigger errors get
even more weight than they otherwise would. Small errors remain small or can even shrink
(if<1), whereas big errors are enlarged and will definitely draw your attention.
Dividing your data into a training set with X% of the observations and keeping the
rest as a holdout data set (a data se -This is the
most common technique.
K-folds cross validation-This strategy divides the data set into k parts and uses each
part one time as a test data set while using the others as a training data set. This has
the advantage that you use all the data available in the data set.
Leave-1 out-This approach is the same as k-folds but with k=1. You always leave one
observation out and train on the rest of the data. This is used only on small data sets,
to people evaluating laboratory experiments than to big data
analysts.
model that generalizes to unseen data. The process of applying your model to new data is
called model scoring. In fact, model scoring is something you implicitly did during
Model scoring involves two steps. First, you prepare a data set that has features
exactly as defined by your model. This boils down to repeating the data preparation you did
in step one of the modeling process but for a new data set. Then you apply the model on this
new data set, and this result in a prediction.
We can divide the different approaches to machine learning by the amount of human
effo m and how they use labeled data-data with a category
or a real-value number assigned to it that represents the outcome of previous observations.
Supervised learning techniques attempt to discern results and learn by trying to find
patterns in a labeled data set. Human interaction is required to label the data.
Unsupervised learning
patterns in a data set without human interaction.
Semi-supervised learning techniques need labeled data, and therefore human
interaction, to find patterns in the data set, but they can still progress toward a result
and learn even if passed unlabeled data as well.
Supervised learning
Supervised learning is a learning technique that can only be applied on labeled data.
One of the many common approaches on the web to stopping computers from hacking
into user accounts is the Captcha check a picture of text and numbers that the human user
must decipher and enter into a form field before sending the form back to the web server.
Something like figure 3.3 should look familiar.
With the help of the Naïve Bayes classifier, a simple yet powerful algorithm to categorize
l in the sidebar, you can recognize
many websites
Step1. Our research goal is to let a computer recognize images of numbers.
Not every email you receive has honest intentions. Your inbox can contain unsolicited
and as a carrier for viruses. Kaspersky3 estimates that more than 60% of the emails in the
world are spam. To protect users from spam, most email clients run a program in the
background that classifies emails as either spam or safe.
A popular technique in spam filtering is employing a classifier that uses the words inside the
mail as predi
composed of (in mathematical terms, P(spam | words) ). To reach this conclusion it uses three
calculations:
Step 2 of the data science process: fetching the digital image data
gray image, you put a value in every matrix entry that depicts the gray value to be shown.
The following code demonstrates this process and is step four of the data science process:
data exploration.
actual code output, but perhaps figure 3.5 can clarify this slightly, because it shows how each
more work to do. The Naïve Bayes classifier is expecting a list of values, but [Link]()
returns a two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into
a list, we need to call reshape() on [Link]. The net result will be a one-dimensional
array that looks something like this:
array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0.,
0., 4., 12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
The end result of this code is called a confusion matrix, such as the one shown in figure 3.6.
Returned as a two-dimensional array, it shows how often the number predicted was the
correct number on the main diagonal and also in the matrix entry (i,j), where j was predicted
but the image showed i. Looking at figure 3.6 we can see that the model predicted the number
2 correctly 17 times (at coordinates 3,3), but also that the model predicted the number 8 15
times when it was actually the number 2 in the image (at 9,3).
Unsupervised learning
Instead, we must take the approach that will work with this data because,
We can study the distribution of the data and infer truths about the data in differ ent
parts of the distribution.
We can study the structure and values in the data and infer new, more meaningful data
and structure from it.
Many techniques exist for each of these unsupervised learning approaches. However, in the
data science process, so you may need to combine or try different techniques before either a
data set can be labeled, enabling supervised learning techniques, perhaps, or even the goal
itself is achieved.
Not everything can be measured. When you meet someone for the first time you
might try to guess whether they like you based on their behavior and how they respond. But
down from attending a funeral the week before? The point is that certain variables can be
immediately available while others can only be inferred and are therefore missing from your
data set. The first types of variables are known as observable variables and the second type
are known as latent variables. In our example, the emotional state of your new friend is a
latent variable. It definitely influences their judge
Deriving or inferring latent variables and their values based on the actual contents of a
data set is a valuable skill to have because
Latent variables can substitute for several existing variables already in the data set.
By reducing the number of variables in the data set, the data set becomes more
manageable, any further algorithms run on it work faster and predictions may become
more accurate.
Because latent variables are designed or targeted toward the defined research goal,
you lose little key information by using them.
Component Analysis
compare how well a set of latent variables works in predicting the quality of wine against the
,
Data set-The University of California, Irvine (UCI) has an online repository of 325
Wine Quality Data Set for red wines created by P. Cortez, A. Cerdeira, F. Almeida, T.
,600 lines long and has 11 variables per line, as shown in
table 3.2.
Principal Component Analysis-A technique to find the latent variables in your data
set while retaining as much information as possible.
Scikit-learn-We use this library because it already implements PCA for us and is a
way to generate the scree plot.
With the initial data preparation behind you, you can execute the PCA. The resulting scree
plot (which will be explained shortly) is shown in figure 3.8. Because PCA is an explorative
technique, we now arrive at step four of the data science process: data exploration, as shown
in the following listing.
The plot generated from the wine data set is shown in figure 3.8. What you hope to
see is an elbow or hockey stick shape in the plot. This indicates that a few variables can
represent the majority of the information in the data set while the rest only add a little more.
In our plot, PCA tells us that reducing the set down to one variable can capture approximately
28% of the total information in the set (the plot is zero-based, so variable
one is at position zero on the x axis), two variables will capture approximately 17% more or
45% total, and so on. Table 3.3 shows you the full read-out.
An elbow shape in the plot suggests that five variables can hold most of the
information found inside the data. You could argue for a cut-off at six or seven variables
opt for a simpler data set versus one with less variance in data
against the original data set.
At this point, we could go ahead and see if the original data set recoded with five
latent variables is good enough to predict the quality of the wine accurately, but before we do,
With the initial decision made to reduce the data set from 11 original variables to 5
interpret or name them based on
their relationships with the originals. Actual names are easier to work with than codes such as
lv1, lv2, and so on. We can add the line of code in the following listing to generate a table
that shows how the two sets of variables correlate.
The rows in the resulting table (table 3.4) show the mathematical correlation. Or, in English,
the first latent variable lv1, which captures approximately 28% of the total information in the
set, has the following formula.
Giving a useable name to each new variable is a bit trickier and would probably require
We can now recode the original data set with only the five latent variables. Doing this is data
preparation again, so we revisit step three of the data science process: data preparation. As
mentioned in chapter 2, the data science process is a recursive one and this is especially true
between step three: data preparation and step 4: data exploration.
Table 3.6 shows the first three rows with this done.
Comparing the accuracy of the original data set with latent variables:
predictive performance improves. The following listing shows how this is done.
Grouping similar observations to gain insight from the distribution of your data:
we attempt to divide our data set into observation subsets, or clusters, wherein observations
should be similar to those in the same cluster but differ greatly from the observations in other
clusters. Figure 3.10 gives you a visual idea of what clustering aims to achieve. The circles in
the top left of the figure are clearly close to each other while being farther away from the
others. The same is true of the crosses in the top right.
k-means is a good general-purpose algorithm with which to get started. However, like
all the clustering algorithms, you need to specify the number of desired clusters in advance,
which necessarily results in a process of trial and error before reaching a decent conclusion. It
also presupposes that all the data required for analysis is available already.
values, so you can end up with a different cluster every time you run the algorithm unless you
manually define the start values by specifying a seed (constant for the start value generator).
can use the more powerful supervised machine learning techniques, in reality we often start
with
learning techniques to analyze what we have and perhaps add labels to the data set, but it will
be prohibitively costly to label it all. Our goal then is to train our predictor models with as
little labeled data as possible. This is where semi-supervised learning techniques come in
Take for example the plot in figure 3.12. In this case, the data has only two labeled
observations; normally this is too few to make valid predictions.
The core framework is composed of a distributed file system, a resource manager, and
a system to run distributed programs. In practice it allows you to work with the distributed
file system almost as easily as with the local file system of your home computer. But in the
background, the data can be scattered among thousands of servers.
On top of that, an ecosystem of applications arose (figure 5.2), such as the databases Hive
chapter. Hive has a language based on the widely used SQL to interact with data stored inside
the database.
MapReduce: How Hadoop Achieves Parallelism
suited for interactive analysis or iterative programs because it writes the data to a disk in
between each computational step. This is expensive when working with large data sets.
director of a toy company. Every toy has two colors, and when a client orders a toy from the
web page, the web page puts an order file on Hadoop with the colors of the toy. Your task is
-style
As the name suggests, the process roughly boils down to two big phases:
Mapping phase - The documents are split up into key-value pairs. Until we reduce,
we can have many duplicates.
Reduce phase - rences
are grouped together, and depending on the reducing function, a different result can
returns.
The whole process is described in the following six steps and depicted in figure 5.4.
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper job parses the colors (keys) out of the file and outputs a file for each
color with the number of times it has been encountered (value). Or more
technically said, it maps a key (the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one file
per key with the total number of occurrences for each color.
6. The keys are collected in an output file.
Data scientists often do interactive analysis and rely on algorithms that are inherently
iterative; it can take awhile until an algorithm converges to a solution. As this is a weak point
uce the Spark Framework to overcome it. Spark
improves the performance on such tasks by an order of magnitude.
What Is Spark?
resource management. For this it relies on systems such as the Hadoop File System, YARN,
or Apache Mesos. Hadoop and Spark are thus complementary systems. For testing and
development, you can even run Spark on your local system.
How Does Spark Solve The Problems Of MapReduce?
While we oversimplify things a bit for the sake of clarity, Spark creates a kind of
shared RAM memory between the computers of your cluster. This allows the different
workers to share variables (and their state) and thus eliminates the need to write the
interme
uses Resilient Distributed Datasets (RDD), which are a distributed memory abstraction that
lets programmers perform in-memory computations on large clusters in a fault tolerant way.1
-memory system, it avoids costly disk operations.
Spark core provides a NoSQL environment well suited for interactive, exploratory
analysis. Spark can be run in batch and interactive mode and supports Python.
Spark has four other large components, as listed below and depicted in figure 5.5.
data bases successfully over multiple nodes, but also to present fundamentally different ways
to model the data at hand to fit its structure to its use case and not to how a relational database
requires it to be modeled.
principles of single-server relational databases and show how NoSQL databases rewrite them
into BASE principles
CAP theorem, which describes the main problem with distributing data bases across multiple
nodes and how ACID and BASE databases approach it.
Atomicity-
put in completely or not at all. If, for instance, a power failure occurs in the
wou
Consistency-This important principle maintains the integrity of the data. No
entry that makes it into the database will ever be in conflict with predefined rules,
such as lacking a required field or a field being numeric instead of text.
Isolation-When something is changed in the database, nothing can happen on this
exact same data at exactly the same moment. Instead, the actions happen in serial
with other changes. Isolation is a scale going from low isolation to high isolation.
On t
Durability-If data has entered the database, it should survive permanently.
Physical damage to the hard discs will destroy records, but power outages and
software crashes should not.
ACID principle because of the consistency ACID promises; The CAP Theorem states that a
database can be any two of the following things but never all three:
Partition tolerant-The database can handle a network partition or network
failure.
Available-
can connect to it, the node will respond, even if the connection between the
different database nodes is lost.
Consistent-
same data.
For a single-
Available-
all the CAP availability promises.
Consistent- no second node, so nothing can be inconsistent.
Things get interesting once the database gets partitioned. Then you need
to make a choice between availability and consistency, as shown in
figure 6.2.
such as the document stores and key-value stores, follow BASE. BASE is a set of much
softer database promises:
Eventual consistency-The database will become consistent over time. In the web
shop example, the table is sold twice, which results in data inconsistency. Once
the connection between the individual nodes is
communicate and decide how to resolve it.
NoSQL database types:
1. Key-value store
2. Document store
3. Column-oriented database
4. Graph database
A full-scale relational database can be made up of many entities and linking tables. Now that
ok at the different types.
1. Key-Value Stores:
Document stores are one step up in complexity from key-value stores: a document
store does assume a certain document structure that can be specified with a schema.
Docum
designed to store everyday documents as is, and they allow for complex querying and
calculations on this often already aggregated form of data.
3. Column-Oriented Database
Traditional relational databases are row-oriented, with each row having a row id and
extra data about hobbies is stored and you have only a single table to describe people, as
shown in figure 6.8. Notice how in this scenario you have slight denormalization because
hobbies could be repeated. If the hobby information is a nice extra but not essential to your
use case, adding it as a list within the Hobbies column is an acceptable approach.
Every time you look up something in a row-oriented database, every row is scanned,
September. The database will scan the table from top to bottom and left to right, as shown in
figure 6.9, eventually returning the list of birthdays.
Indexing the data on certain columns can significantly improve lookup speed, but indexing
every column brings extra overhead and the database is still scanning all the columns.
Column databases store each column separately, allowing for quicker scans when only a
small number of columns is involved; see figure 6.10.
4. Graph Databases
The last big NoSQL database type is the most complex one, geared toward storing
relations between entities in an efficient manner. When the data is highly interconnected,
such as for social networks, scientific paper citations, or capital asset clusters, graph
databases are the answer. Graph or network data has two main components:
Node-The entities themselves. In a social network this could be people.
Edge-The relationship between two entities. This relationship is represented by a line
and has its own properties. An edge can have a direction, for example, if the arrow
indicates who is whose boss.
Graphs can become incredibly complex given enough relation and entity types. Figure
6.14 already shows that complexity with only a limited number of entities. Graph
databases like Neo4j also claim to uphold ACID, whereas document stores and key value
stores adhere to BASE.
QUESTION BANK:
UNIT I
1. Data is mainly used to:
a) Confuse people b) Make better decisions
c) Increase mistakes d) Reduce accuracy
Answer: b
2. Raw data is best described as:
a) Organized facts b) Unprocessed information
c) Final results d) Cleaned data
Answer: b
3. The first step in the data science process is:
a) Modeling b) Deployment
c) Data collection d) Evaluation
Answer: c
4. Big data mainly deals with:
a) Small data b) Large and complex datasets
c) Only numbers d) Only images
Answer: b
5-MARK QUESTIONS
1. Explain any five benefits and uses of data in organizations.
2. What are the different types of data? Explain with examples.
3. Write a short note on the data science process.
UNIT III
1. Machine learning algorithms help systems to:
a) Sleep b) Learn from data
c) Draw cartoons d) Delete files
Answer: b
2. Supervised learning uses:
a) No data b) Only images
c) Labeled data d) Only videos
Answer: c
3. Unsupervised learning deals with:
a) Labeled data b) Unlabeled data
c) Audio files only d) No patterns
Answer: b
4. Semi-supervised learning uses:
a) Only unlabeled data
b) Only text
c) A small amount of labeled data + large unlabeled data
d) No data at all
Answer: c
5. The first step in the modeling process is:
a) Testing
b) Understanding the problem
c) Deployment
d) Deleting data
Answer: b
6. An example of a supervised learning algorithm is:
a) K-Means b) Linear Regression
c) Apriori d) PCA
Answer: b
7. An example of an unsupervised learning algorithm is:
a) Decision Tree b) Logistic Regression
c) K-Means Clustering d) Naive Bayes
Answer: c
8. Semi-supervised learning is useful when:
a) Labeled data is expensive b) Labeled data is unlimited
c) No data is available d) Models are not needed
Answer: a
9. The modeling process includes:
a) Training, testing, evaluation
b) Playing games
c) Sending emails
d) Copying folders
Answer: a
10. Supervised learning problems are mainly of two types:
a) Noise and error b) Classification and regression
c) Groups and clusters d) None
Answer: b
5-MARK QUESTIONS
1. Explain supervised learning with examples.
2. What is unsupervised learning? Describe any two common algorithms.
3. Write a short note on semi-supervised learning.
4. Explain the major steps in the modeling process.
5. List any five machine learning algorithms and explain their purpose.
6. Differentiate between supervised and unsupervised learning.
7. Explain the importance of labeled and unlabeled data in ML types.
8. Describe the role of training and testing in model building.
10-MARK QUESTIONS
1. Explain in detail the three types of machine learning: supervised, unsupervised, and semi-
supervised. Provide examples and applications for each.
2. Discuss the modeling process in detail. Explain data preparation, model selection, training,
validation, testing, and evaluation.
3. Explain supervised learning. Discuss classification and regression algorithms with real-
time examples.
4. Write a detailed note on unsupervised learning. Explain clustering, dimensionality
reduction, and association rules.
5. Discuss semi-supervised learning. Explain how it combines features of supervised and
unsupervised learning with examples.
6. Compare supervised, unsupervised, and semi-supervised learning. Explain similarities,
differences, advantages, and limitations.
7. Explain machine learning algorithms in detail. Describe at least five algorithms (e.g.,
Linear Regression, Decision Tree, KNN, K-Means, Naive Bayes).
8. Describe how the modeling process helps in building, validating, and improving machine
learning models. Provide a neat diagram.
UNIT IV
10-MARK QUESTIONS
1. Explain the Hadoop framework in detail. Discuss HDFS, YARN, and MapReduce with a
neat diagram.
2. Discuss Apache Spark in detail. Explain its architecture, features, advantages, and why it is
faster than MapReduce.
3. Compare MapReduce and Spark. Explain how Spark replaces MapReduce along with real-
world examples.
4. Explain NoSQL databases in detail. Discuss characteristics, benefits, and all major types
with examples.
5. Describe ACID properties in detail. Explain how they ensure reliable database transactions
with real-life examples.
6. Explain the CAP theorem in depth. Discuss trade-offs between consistency, availability,
and partition tolerance in distributed systems.
7. Discuss BASE properties in detail. Compare ACID vs BASE and explain their application
in modern databases.
8. Write a detailed note on NoSQL database types: Key-value store, Document store, Wide-
column store, Graph store. Add examples and use cases.
UNIT V
1. What is the first step in a disease prediction case study?
a) Model evaluation b) Setting research goals
c) Data visualization d) Deployment
Answer: b
2. Data retrieval mainly deals with:
a) Cleaning data b) Collecting data from various sources
c) Visualizing results d) Deploying a model
Answer: b
3. Which of the following is part of data preparation?
a) Identifying symptoms b) Handling missing values
c) Deploying the model d) Writing conclusions
Answer: b
4. Exploratory Data Analysis (EDA) helps in:
a) Removing algorithms b) Understanding patterns in data
c) Encrypting data d) Creating user interfaces
Answer: b
5. Disease profiling means:
a) Visualizing patient names b) Identifying disease patterns & features
c) Predicting salaries d) Backing up the database
Answer: b
6. Automation in disease prediction refers to:
a) Making manual reports b) Automating prediction tasks
c) Removing data sources d) Storing paper records
Answer: b
7. Which data is commonly used in disease prediction?
a) Weather data b) Medical & clinical data
c) Vehicle data d) Movie rating data
Answer: b
8. In EDA, correlation helps identify:
a) Missing files b) Relationships between variables
c) User passwords d) Hardware issues
Answer: b
9. A predictive model for diseases generally outputs:
a) A picture b) A prediction (Yes/No or category)
c) A poem d) A video
Answer: b
10. Presenting results in disease prediction often includes:
a) Charts and reports b) Games
c) Audio recordings d) None
Answer: a
FIVE-MARK QUESTIONS
1. Explain the process of setting research goals for a disease prediction project.
2. Describe the methods used for disease-related data retrieval.
3. What are the key steps in data preparation for disease prediction?
4. Write a short note on Exploratory Data Analysis (EDA) in disease prediction.
5. Explain disease profiling with suitable examples.
6. What is the role of data cleaning in effective disease prediction?
7. Describe the importance of visualization in predicting diseases.
8. Mention any five challenges in data retrieval and preparation for medical datasets.
9. Write the need for automation in disease prediction systems.
10. Explain the significance of feature selection in disease prediction.
TEN-MARK QUESTIONS
1. Explain the complete workflow of a disease prediction case study from setting research
goals to final presentation.
2. Discuss in detail the process of data retrieval, preparation, and exploration for disease
prediction with examples.
3. Describe how EDA helps in understanding patterns and symptoms in disease datasets.
Give suitable charts or examples.
4. Explain disease profiling in detail. How does profiling help in diagnosing and predicting
specific diseases?
5. Write a detailed note on building predictive models for diseases. Explain data preparation,
algorithm selection, training, and evaluation.
6. Discuss how automation improves disease prediction systems. Explain tools, workflows,
and real-life examples.
7. Compare manual disease analysis with automated disease prediction systems. Explain
benefits and limitations.
8. Explain the challenges of working with medical data in disease prediction privacy,
missing data, noise, imbalance, etc.
--------------------End--------------------