0% found this document useful (0 votes)
9 views72 pages

Data Science - Mass-With Question Bank-3cs

The document outlines a syllabus for a data science course, covering topics such as the data science process, machine learning algorithms, and the big data ecosystem. It emphasizes the importance of structured approaches in data science projects and discusses various data types and their applications. Additionally, it introduces tools and frameworks essential for managing and analyzing big data, including Hadoop and NoSQL databases.

Uploaded by

athirumal08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views72 pages

Data Science - Mass-With Question Bank-3cs

The document outlines a syllabus for a data science course, covering topics such as the data science process, machine learning algorithms, and the big data ecosystem. It emphasizes the importance of structured approaches in data science projects and discusses various data types and their applications. Additionally, it introduces tools and frameworks essential for managing and analyzing big data, including Hadoop and NoSQL databases.

Uploaded by

athirumal08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Syllabus:

Unit I:
Introduction: Benefits and uses Facts of data Data science process Big data ecosystem
and data science.

Unit II:
The Data science process: Overview research goals - retrieving data - transformation
Exploratory Data Analysis Model building.

Unit III:
Algorithms: Machine learning algorithms Modeling process Types Supervised
Unsupervised - Semi-supervised.

Unit IV:
Introduction to Hadoop: Hadoop framework Spark replacing MapReduce NoSQL
ACID CAP BASE types.

Unit V:
Case Study: Prediction of Disease - Setting research goals - Data retrieval preparation -
exploration - Disease profiling - presentation and automation
Text Book:

publications 2016.

Reference Books:
1.
2. Making Sense of Data with
A -book.
3.

4.
A
5.
O'Reilly Media 2013.
6. .
UNIT I

Introduction: Benefits and uses Facts of data Data science process Big data ecosystem
and data science.

INTRODUCTION:
What is Big data and Data science?
Big data is a blanket term for any collection of data sets so large or complex that it
becomes difficult to process them using traditional data management techniques such as, for
example, the RDBMS (relational database management systems). The widely adopted
RDBMS has long been regarded as a one-size-fits-all solution, but the demands of handling
big data have shown otherwise. Data science involves using methods to analyze massive
amounts of data and extract the knowledge it contains. You can think of the relationship
between big data and data science as being like the relationship between crude oil and an oil
refinery. Data science and big data evolved from statistics and traditional data management
but are now considered to be distinct disciplines.

Data science is an evolutionary extension of statistics capable of dealing with the


massive amounts of data produced today. It adds methods from computer science to the
repertoire of statistics.

Benefits and uses of data science and big data:

Data science and big data are used almost everywhere in both commercial and
noncommercial settings. The number of use cases is vast.

1. Commercial companies in almost every industry use data science and big data to gain
insights into their customers, processes, staff, completion, and products. Many companies use
data science to offer customers a better user experience, as well as to cross-sell, up-sell, and
personalize their offerings.

A good example of this is Google AdSense, which collects data from internet users so
relevant commercial messages can be matched to the person browsing the internet. MaxPoint
([Link]
2. Governmental organizations
organizations not only rely on internal data scientists to discover valuable information, but
also share their data with the public. You can use this data to gain insights or build data-
driven applications. Example: [Link] is but one

3. Nongovernmental organizations (NGOs) are also no strangers to using data. They use it
to raise money and defend their causes. The World Wildlife Fund (WWF), for instance,
employs data scientists to increase the effectiveness of their fundraising efforts. Many data
scientists devote part of their time to helping NGOs, because NGOs often lack the resources
to collect data and employ data scientists. Data Kind is one such data scientist group that
devotes its time to the benefit of mankind.

4. Universities use data science in their research but also to enhance the study experience of
their students. The rise of massive open online courses (MOOC) produces a lot of data, which
allows universities to study how this type of learning can complement traditional classes.

Facets of data

of them tends to require different tools and techniques. The main categories of data are these:

-generated
-based

Structured data:
Structured data is data that depends on a data model and resides in a fixed field within
a re
files (figure 1.1). SQL, or Structured Query Language, is the preferred way to manage and
query data that resides in databases. You may also come across structured data that might
give you a hard time storing it in a traditional relational database. Hierarchical data such as a

imposed upon it by humans and machines. More often, data comes unstructured.
Unstructured data:
-
specific or varying. One example of unstructured data is your regular email (figure 1.2). Although
email contain
number of people who have written an email complaint about a specific employee because so many
ways exist to refer to a person, for example. The thousands of different languages and dialects out
there further complicate this. A human-written email, as shown in figure 1.2, is also a perfect example
of natural language data.
Natural language:

to process
because it requires knowledge of specific data science techniques and linguistics.

The natural language processing community has had success in entity recognition,
topic recognition, summarization, text completion, and sentiment analysis, but models trained
-of-the-

ous by nature. The concept of


meaning itself is questionable here. Have two people listen to the same conversation. Will
they get the same meaning? The meaning of the same words can vary when coming from
someone upset or joyous.

Machine-generated data:

Machine-
process, application, or other machine without human intervention. Machine-generated data
is becoming a major data resource and will continue to do so. Wikibon has forecast that the
market value of the industrial Internet (a term coined by Frost & Sullivan to refer to the
integration of complex physical machinery with networked sensors and software) will be
approximately $540 billion in 2020. IDC (International Data Corporation) has estimated there
will be 26 times more connected things than people in 2020. This network is commonly
referred to as the internet of things. The analysis of machine data relies on highly scalable
tools, due to its high volume and speed. Examples of machine data are web server logs, call
detail records, network event logs, and telemetry (figure 1.3).
Graph-based or network data:

raph theory. In graph theory, a graph is a


mathematical structure to model pair-wise relationships between objects. Graph or network
data is, in short, data that focuses on the relationship or adjacency of objects. The graph
structures use nodes, edges, and properties to represent and store graphical data. Graph-based
data is a natural way to represent social networks, and its structure allows you to calculate
specific metrics such as the influence of a person and the shortest path between two people.

Examples of graph-based data can be found on many social media websites (figure
1.4). For instance, on LinkedIn you can see who you know at which company. Your follower
list on Twitter is another example of graph-based data. The power and sophistication comes
from multiple, overlapping graphs of the same nodes. For example, imagine the connecting

which connect business colleagues via LinkedIn. Imagine a third graph based on movie
interests on Netflix. Overlapping the three different-looking graphs makes more interesting
questions possible.

Audio, image, and video:

Audio, image, and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures, turn out to
be challenging for computers. MLBAM (Major League Baseball Advanced Media)
purpose of live, in-game analytics. High-speed cameras at stadiums will capture ball and
athlete movements to calculate in real time, for example, the path taken by a defender relative
to two baselines.

capable of learning how to play video games. This algorithm takes the video screen as input

feat that prompted Google to buy the company for their own Artificial Intelligence (AI)

Streaming data:

While streaming data can take almost any of the previous forms, it has an extra
property. The data flows into the system when an event happens instead of being loaded into

such because you need to adapt your process to deal with this type of information. Examples

The data science process

The data science process typically consists of six steps,


1. Setting the research goal:

Data science is mostly applied in the context of an organization. When the business

that, what data and resources you need, a timetable, and deliverables.

2. Retrieving data:

need and where you can find it. In this step you ensure that you can use the data in your
program, which means checking the existence of, quality, and access to the data. Data can
also be delivered by third-party companies and takes many forms ranging from Excel
spreadsheets to different types of databases.

3. Data preparation:

Data collection is an error-prone process; in this phase you enhance the quality of the
data and prepare it for use in subsequent steps. This phase consists of three subphases: data
cleansing removes false values from a data source and inconsistencies across data sources,
data integration enriches data sources by combining information from multiple data sources,
and data transformation ensures that the data is in a suitable format for use in your models.

4. Data exploration:

Data exploration is concerned with building a deeper understanding of your data. You
try to understand how variables interact with each other, the distribution of the data, and
whether there are outliers. To achieve this you mainly use descriptive statistics, visual
techniques, and simple modeling. This step often goes by the abbreviation EDA, for
Exploratory Data Analysis.

5. Data modeling or model building:

In this phase you use models, domain knowledge, and insights about the data you
found in the previous steps to answer the research question. You select a technique from the
fields of statistics, machine learning, operations research, and so on. Building a model is an
iterative process that involves selecting the variables for the model, executing the model, and
model diagnostics.
6. Presentation and automation:

Finally, you present the results to your business. These results can take many forms,

execution of the process because the business will want to use the insights you gained in
another project or enable an operational process to use the outcome from your model.

AN ITERATIVE PROCESS The previous description of the data science process


gives you the impression that you walk through this process in a linear way, but in reality you
often have to step back and rework certain findings. For instance, you might find outliers in
the data exploration phase that point to data import errors. As part of the data science process
you gain incremental insights, which may lead to new questions. To prevent rework, make
sure that you scope the business question clearly and thoroughly at the start.

The big data ecosystem and data science

new technologies appear rapi


ecosystem can be grouped into technologies that have similar goals and functionalities, which

l dedicate a separate chapter to the most important data science technology classes.

The mind map in figure 1.6 shows the components of the big data ecosystem and

diagram and see what each does.

Distributed file systems


Distributed programming framework
Data integration framework
Machine learning frameworks
NoSQL databases
Scheduling tools
Benchmarking tools
System deployment
Service programming
Security
1. Distributed file systems

A distributed file system is similar to a normal file system, except that it runs on

reading, and deleting files and


adding security to files are at the core of every file system, including the distributed one.
Distributed file systems have significant advantages:

They can store files larger than any one computer disk.
Files get automatically replicated across multiple servers for redundancy or par allel
operations while hiding the complexity of doing so from the user.

restrictions of a single server.


2. Distributed programming framework

Once you have the data stored on the distributed file system, you want to exploit it.

to the data. When you start from


scratch with a normal general-purpose programming language such as C, Python, or Java,
you need to deal with the complexities that come with distributed programming, such as
restarting jobs that have failed, tracking the results from the different sub processes, and so
on. Luckily, the open source community has developed many frameworks to handle this for
you, and these give you a much better experience working with distributed data and dealing
with many of the challenges it carries.

3. Data integration framework

Once you have a distributed file system in place, you need to add data. You need to
move data from one source to another, and this is where the data integration frame works
such as Apache Sqoop and Apache Flume excel. The process is similar to an extract,
transform, and load process in a traditional data warehouse.

4. Machine learning frameworks

where you rely on the fields of machine learning, statistics, and applied mathematics. One of

data we need to analyze today, this becomes problematic, and specialized frameworks and
libraries are required to deal with this amount of data. The most popular machine-learning
library for Python is Scikit- -
in the book. There are, of course, other Python libraries:

PyBrain for neural networks - Neural networks are learning algorithms that
mimic the human brain in learning mechanics and complexity. Neural networks
are often regarded as advanced and black box.
NLTK or Natural Language Toolkit - As the name suggests, its focus is working
with natu
ber of text corpuses to help you model your own data.
Pylearn2 Another machine learning toolbox but a bit less mature than Scikit-
learn,
TensorFlow A Python library for deep learning provided by Google.

5. NoSQL databases

managing and querying this data. Traditionally this has been the playing field of relational
databases such as Oracle SQL, MySQL, Syba -to
technology for many use cases, new types of databases have emerged under the grouping of
NoSQL databases.

Many different types of databases have arisen, but they can be categorized into the
following types:

Column databases - Data is stored in columns, which allows algorithms to per


form much faster queries. Newer technologies use cell-wise storage. Table-like
structures are still important.
Document stores - Document stores no longer use tables, but store every
observation in a document. This allows for a much more flexible data scheme.
Streaming data - Data is collected, transformed, and aggregated not in batches

in tool selecti lem that drove creation of


technologies such as Storm.
Key-value stores -
every value, such as [Link].2015: 20000. This scales well but
places almost all the implementation on the developer.
SQL on Hadoop - Batch queries on Hadoop are in a SQL-like language that
uses the map-reduce framework in the background.
New SQL - This class combines the scalability of NoSQL databases with the
advantages of relational databases. They all have a SQL interface and a
relational data model.
Graph databases - Not every problem is best stored in a table. Particular
problems are more naturally translated into graph theory and stored in graph
data bases. A classic example of this is a social network.
6. Scheduling tools
Scheduling tools help you automate repetitive tasks and trigger jobs based on events
such as adding a new file to a folder. These are similar to tools such as CRON on Linux but
are specifically developed for big data. You can use them, for instance, to start a MapReduce
task whenever a new dataset is available in a directory.

7. Benchmarking tools
This class of tools was developed to optimize your big data installation by providing
standardized profiling suites. A profiling suite is taken from a representative set of big data
jobs. Benchmarking and optimizing the big data infrastructure and configura
jobs for data scientists themselves but for a professional specialized in setting up IT

make a big cost difference. For example, if you can gain 10% on a cluster of 100 servers, you
save the cost of 10 servers.

8. System deployment
Setting up a big
deploying new applications into the big data cluster is where system deployment tools shine.
They largely automate the installation and configuration of big data compo
core task of a data scientist.

9. Service programming
-class soccer prediction application on Hadoop, and
you want to allow others to use the predictions made by your application. However, you have
no idea of the architecture or technology of everyone keen on using your predictions. Service
tools excel here by exposing big data applications to other applications as a service. Data
scientists sometimes need to expose their models through services. The best-known example
is the REST service; REST stands for representa
websites with data.

10. Security
Big data security tools allow you to have central and fine-grained control over access
to the data. Big data security has become a topic in its own right, and data scientists are
usually only confronted with it as data consumers; seldom will they implement the security

job for the security expert.


UNIT II

The Data science process: Overview - research goals - retrieving data - transformation
Exploratory Data Analysis - Model building.

Overview of the data science process

Structured approach to data science helps you to maximize your chances of success in
a data science project at the lowest cost. It also makes it possible to take up a project as a
team, with each team member focusing on what they do best. Take care, however: this
approach may not be suitable for every type of project or be the only way to do good data
science. The typical data science process consists of six ate, as
shown in figure 2.1.

Figure 2.1 summarizes the data science process and shows the main steps and actions
t.
1. The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the project. In
every serious project this will result in a project charter.
2. The second phase is data retrieval. You want to have data available for analysis, so
this step includes finding suitable data and getting access to the data from the data
owner. The result is data in its raw form, which probably needs polishing and
transformation before it becomes usable.
3.

ine data from


different data sources, and transform it.
4. The fourth step is data exploration. The goal of this step is to gain a deep

on visual and descriptive techniques. The insights you gain from this phase will
enable you to start modeling.
5. Model building
that you attempt to gain the insights or make the predictions stated in your project
charter. Now is the time to bring out the heavy guns, but remember research has
taught us that often (but not always) a combination of simple models tends to

done.
6. The last step of the data science model is presenting your results and automating
the analysis, if needed. One goal of a project is to change a process and/or make
better decisions. You may still need to convince the business that your findings will
indeed change the business process as expected. This is where you can shine in your
influencer role. The importance of this step is more apparent in projects on a strategic
and tactical level. Certain projects require you to perform the business process over
and over again, so automating the project will save time.

Defining research goals and creating a project charter

A project starts by understanding the what, the why, and the how of your project
(figure 2.2). What does the company expect you to do? And why does management place

originating from an opportunity someone detected? Answering these three questions (what,
why, how) is the goal of the first phase, so that everybody knows what to do and can agree on
the best course of action.

The outcome should be a clear research goal, a good understanding of the context,
well-defined deliverables, and a plan of action with a timetable. This information is then best
placed in a project charter. The length and formality can, of course, differ between projects
and companies. In this early phase of the project, people skills and business acumen are more
important than great technical prowess, which is why this part will often be guided by more
senior personnel.

Spend time understanding the goals and context of your research

An essential outcome is the research goal that states the purpose of your assignment in
a clear and focused manner. Understanding the business goals and context is critical for
project success. Continue asking questions and devising examples until you grasp the exact
business expectations, identify how your project fits in the bigger picture, appreciate how
l use your results.
Nothing is more frustrating than spending months researching something until you have that
one moment of brilliance and solve the problem, but when you report your findings back to
the organization, everyone immediately realizes that yo
skim over this phase lightly. Many data scientists fail here: despite their mathematical wit and
scientific brilliance, they never seem to grasp the business goals and context.
Create a project charter

Clients like t so after you have a good


understanding of the business problem, try to get a formal agreement on the deliverables. All
this information is best collected in a project charter. For any significant project this would be
mandatory.

A project charter requires teamwork, and your input covers at least the following:

A clear research goal


The project mission and context

What resources you expect to use


le project, or proof of concepts Deliverables and a
measure of success
A timeline

Retrieving data

Sometimes you need to go into the field and design a data collection process yourself,
companies will have already
col

more organizations are making even high-quality data freely available for public and
commercial use.
Data can be stored in many forms, ranging from simple text files to tables in a data
base. The objective now is acquiring all the data you need. This may be difficult, and even if
you succeed, data is often like a diamond in the rough: it needs polishing to be of any use to
you.

Start with data stored within the company

available within your company. Most companies have a program for maintaining key data; so
much of the cleaning work may already be done. This data can be stored in official data
repositories such as databases, data marts, data warehouses, and data lakes maintained by a
team of IT professionals. The primary goal of a database is data storage, while a data
warehouse is designed for reading and analyzing that data. A data mart is a subset of the data
warehouse and geared toward serving a specific business unit. While data warehouses and
data marts are home to preprocessed data, a data lake contains data in its natural or raw
format. But the possibility exists that your data still resides in Excel files on the desktop of a
domain expert.

Finding data even within your own company can sometimes be a challenge. As
companies grow, their data becomes scattered around many places. Knowledge of the data
may be dispersed as people change positions and leave the company. Documentation and
need to
develop some Sherlock Holmes like skills to find all the lost bits.

Getting access to data is another difficult task. Organizations understand the value and
sensitivity of data and often have policies in place so everyone has access to what they need
and nothing more. These policies translate into physical and digital barriers called Chinese
-regulated for customer data in most countries.
This is for good reasons, too; imagine everybody in a credit card company having access to
your spending habits. Getting access to the data may take time and involve company politics.

Many companies specialize in collecting valuable information. For instance, Nielsen and
GFK are well known for this in the retail industry. Other companies provide data so that you,
in turn, can enrich their services and ecosystem. Such is the case with Twitter, LinkedIn, and
Facebook. Although data is considered an asset more valuable than oil by certain companies,
more and more governments and organizations share their data for free with the world. This
data can be of excellent quality; it depends on the institution that creates and manages it. The
information they share covers a broad range of topics such as the number of accidents or
amount of drug abuse in a certain region and its demographics. This data is helpful when you
want to enrich proprietary data but also convenient when training your data science skills at
home. Table 2.1 shows only a small selection from the growing number of open-data
providers.

Do data quality checks now to prevent problems later

Expect to spend a good portion of your project time doing data correction and
cleans

phase are easy to spot, but being too careless will make you spend many hours solving data
issues that could have been prevented during data import.

phases. The difference is in the goal and the depth of the investigation. During data retrieval,
you check to see if the data is equal to the data in the source document and look to see if you

the data is similar to the data you find in the source document, you stop. With data
preparation, you do a more elaborate check. If you did a good job during the previous phase,
the errors you find now are also present in the source document. The focus is on the content
of the variables: you want to get rid of typos and other data entry errors and bring the data to
a common standard among the data sets. For example, you might correct USQ to USA and
United Kingdom to UK. During the exploratory phase your focus shifts to what you can learn
from the data. Now you assume the data to be clean and look at the statistical properties such
as distribu
when you discover outliers in the exploratory phase, they can point to a data entry error. Now

deeper into the data preparation step.

TRANSFORMING DATA

and i
a suitable form for data modeling.

Take, for instance, a relationship of the form y = aebx. Taking the log of the independent
variables simplifies the estimation problem dramatically. Figure 2.11 shows how
transforming the input variable greatly simplifies the estimation problem. Other times you
might want to combine two variables into a new variable.
REDUCING THE NUMBER OF VARIABLES

Sometimes you have too many variables and need to reduce the number because they

model difficult to handle,


with too many input variables. For instance, all the techniques based on a Euclidean distance
perform well only up to 10 variables.

Data scientists use special methods to reduce the number of variables but retain the
these methods in chapter 3. Figure 2.12
shows how reducing the number of variables makes it easier to understand the key values. It
also shows how two variables account for 50.6% of the variation within the data set

components of the underlying data structure. If i


principal components analysis (PCA) will be explained more thoroughly in chapter 3. What
you can also see is the presence of a third (unknown) variable that splits the group of
observations into two.
TURNING VARIABLES INTO DUMMIES

Variables can be turned into dummy variables (figure 2.13). Dummy variables can

categorical effect that may explain the observation. In this


for the classes stored in one variable and indicate it with 1 if the class is present and 0
otherwise. An example is turning one column named Weekdays into the columns Monday
through Sunday. You use an indicator to show if the observation was on a Monday; you put 1

modeling and is popular with, but not exclusive to, economists.

In this section we introduced the third step in the data science process-cleaning,
transforming, and integrating data-which changes your raw data into usable input for the
modeling phase. The next step in the data science process is to get a better understanding of
the content of the data and the relationships between the variables and observations; we
explore this in the next section.

Exploratory data analysis

During exploratory data analysis you take a deep dive into the data (see figure 2.14).
Information becomes much easier to grasp when shown in a picture, therefore you mainly use
graphical techniques to gain an understanding of your data and the inter actions between
variables. This phase is about exploring data, so keeping your mind open and your eyes
peeled is essential during the exploratory data analysi

take a step back and fix them.

The visualization techniques you use in this phase range from simple line graphs or
histograms, as shown in figure 2.15, to more complex diagrams such as Sankey and network

more insight into the data. Other times the graphs can be animated or made interactive to

Figure 2.15 from top to bottom, a bar chart, a line plot, and a distribution are some of the graphs used in exploratory analysis.

These plots can be combined to provide even more insight, as shown in figure 2.16.

Figure 2.16 Drawing multiple plots together can help you understand the structure of your data over multiple variables.
Two other important graphs are the histogram shown in figure 2.19 and the boxplot
shown in figure 2.20.

In a histogram a variable is cut into discrete categories and the number of occurrences
in each category are summed up and shown in the graph. The boxplot, on the other hand,

distribution within categories. It can show the maximum, minimum, median, and other
characterizing measures at the same time.
Build the models

build models with the goal of making better predictions, classifying objects, or gaining an

outcome to be. Figure 2.21 shows the components of model building.

mining, and/or statistics. In this chapter we only explore the tip of the iceberg of existing
techniques, while chapt

techniques will help you in 80% of the cases because techniques overlap in what they try to
accomplish. They often achieve their goals in similar but slightly different ways.

Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine learning school,
and the type of technique you want to use. Either way, most models consist of the following
main steps:

1. Selection of a modeling technique and variables to enter in the model


2. Execution of the model
3. Diagnosis and model comparison

Model and variable selection

technique. Your findings from the exploratory analysis should already give a fair idea of what
variables will help you construct a good model. Many modeling techniques are available, and

consider model performance and whether your project meets all the requirements to use your
model, as well as other factors:

Must the model be moved to a production environment and, if so, would it be easy to
implement?
How difficult is the maintenance on the model: how long will it remain relevant if left
untouched?
Does the model need to be easy to explain?

Model execution
l need to implement it in code.
Most programming languages, such as Python, already have libraries such as StatsModels or
Scikit-learn. These packages use several of the most popular techniques. Coding a model is a
nontrivial task in most cases, so having these libraries available can speed up the process. As

StatsModels or Scikit-learn. Doing this your self would require much more effort even for the
simple techniques. The following listing shows the execution of a linear prediction model.
We, however, created the target variable, based on the predictor by adding a bit of
-fitting model. The
[Link]() outputs the table in figure 2.23. Mind you, the exact outcome depends on
the random variables you got.

Model fit - For this the R-squared or adjusted R-squared is used. This measure is an
indication of the amount of variation in the data that gets captured by the model. The
difference between the adjusted R-squared and the R-squared is minimal here because the
adjusted one is the normal one + a penalty for model complexity. A model gets complex

simple model is available, so the adjusted R-squared punishes you for overcomplicating. At
any rate, 0.893 is high, and it should be because we cheated. Rules of thumb exist, but for
models in businesses, models above 0.85 are often considered good. If you want to win a
competition you need in the high 90s. For research however, often very low model fits (<0.2
even) s more important there is the influence of the introduced predictor
variables.

Predictor variables have a coefficient - For a linear model this is easy to interpret.
finding a good predictor can be your route to a Nobel Prize even though your model as a
whole is rubbish. If, for instance, you determine that a certain gene is significant as a cause
for cancer, this is important knowledge, even if that gene in itself doesn
a person will get cancer. The example here is classification, not regression, but the point
remains the same: detecting influences is more important in scientific studies than perfectly
fitting models (not to mention more realistic). But when do we know a gene has that impact?
This is called significance.

Predictor significance - Coefficients are great, but sometimes not enough evidence
exists to show that the influence is there. This is what the p-value is about. A long
explanation about type 1 and type 2 mistakes is possible here but the short explanations
would be: if the p-value is lower than 0.05, the variable is considered significant for most
redictor

Several people introduced the extremely significant (p<0.1).

Linear regression works if you want to predict a value, but what if you want to
classify something? Then you go to classification models, the best known among them being
k-nearest neighbors.

As shown in figure 2.24, k-nearest neighbors looks at labeled points nearby an


unlabeled point and, based on this, makes a prediction of what the label should be.
As before, we construct random correlated data and surprise, surprise we get 85% of cases

applying it on data to make a prediction.


prediction = [Link](predictors)
Now we can use the prediction and compare it to the real thing using a confusion matrix.
metrics.confusion_matrix(target,prediction)
We get a 3-by-3 matrix as shown in figure 2.25.

is it really a surprise? No, for the following reasons:

For one, the classifier had but three options; marking the difference with last time
n

even for a real random distribution like flipping a coin.


Second, correlating the response variable with the predictors. Because of the way we

already have a similar result.


We compared the prediction with the real values, true, but we never predicted based
on fresh data. The prediction was done using the same data as the data used to build
the model. This is all fine and dandy to make yourself feel good, but it gives you no
indication of whether your model will work when it encounters truly new data. For
this we need a holdout sample, as will be discussed in the next section.
Model diagnostics and model comparison

multiple criteria. Working with a holdout sample helps you pick the best-performing model.
A holdout sample is a part of the data you leave out of the model building so it can be used to
evaluate the model afterward. The principle here is simple: the model should work on unseen
data. You use only a fraction of your data to estimate the model and the other part, the
holdout sample, is kept out of the equation. The model is then unleashed on the unseen data
and error measures are calculated to evaluate it. Multiple error measures are available, and in
figure 2.26 we show the general idea on comparing models. The error measure used in the
example is the mean square error.

Figure 2.26 Formula for mean square error


Mean square error is a simple measure: check for every prediction how far it was from the
truth, square this error, and add up the error of every prediction.

Figure 2.27 compares the performance of two


models to predict the order size from the price. The
first model is size = 3 * price and the second model
is size = 10. To estimate the models, we use 800
randomly chosen observations out of 1,000 (or
80%), without showing the other 20% of data to
the model. Once the model is trained, we predict
the values for the other 20% of the variables based
on those for which we already know the true value,
and calculate the model error with an error
measure. Then we choose the model with the lowest error. In this example we chose model 1
because it has the lowest total error.

Many models make strong assumptions, such as independence of the inputs, and you
have to verify that these assumptions are indeed met. This is called model diagnostics.
This section gave a short introduction to the steps required to build a valid model.
UNIT III
Algorithms: Machine learning algorithms Modeling process Types Supervised
Unsupervised - Semi-supervised.

INTRODUCTION TO MACHINE LEARNING

Machine Learning (ML) is a subset of artificial intelligence that enables systems to


learn patterns from data and make decisions or predictions without being explicitly
programmed.

Key Characteristics

Learns from data.

Improves performance over time.

Identifies patterns automatically.

Used in classification, prediction, clustering, recommendation, etc.

THE MODELING PROCESS

The modeling phase consists of four steps:

1. Feature engineering and model selection


2. Training the model
3. Model validation and selection
4. Applying the trained model to unseen data

Feature engineering and model selection:

With engineering features, you must come up with and create possible predictors for
the model. This is one of the most important steps in the process because a model recombines
these features to achieve its predictions. Often you may need to consult an expert or the
appropriate literature to come up with meaningful features.

Certain features are the variables you get from a data set, as is the case with the
pro
find the features yourself, which may be scattered among different data sets. In several
projects we had to bring together more than 20 different data sources before we had the raw
data we required.

In medicine, clinical pharmacy is a discipline dedicated to researching the effect of


the interaction of medicine
medicines to produce potentially dangerous results. For example, mixing an antifungal
medicine such as Sporanox with grapefruit has serious side effects.

Training your model

With the right predictors in place and a modeling technique in mind, you can progress
to model training. In this phase you present to your model data from which it can learn.

The most common modeling techniques have industry-ready implementations in


almost every programming language, including Python. These enable you to train your
models by executing a few lines of code. For more state-of-the art data science techniques,
ematical calculations and implementing them with
modern computer science techniques.

model validation.

Validating a model

Data science has many modeling techniques, and the question is which one is the right
one to use. A good model has two properties: it has good predictive power and it generalizes

is) and a validation strategy.

Two common error measures in machine learning are the classification error rate for
classification problems and the mean squared error for regression problems. The
classification error rate is the percentage of observations in the test data set that your model
mislabeled; lower is better. The mean squared error measures how big the average error of

wrong prediction in one direction with a faulty prediction in the other direction. For example,
overestimating future turnover
by 5,000 for the following month. As a second consequence of squaring, bigger errors get
even more weight than they otherwise would. Small errors remain small or can even shrink
(if<1), whereas big errors are enlarged and will definitely draw your attention.

Many validation strategies exist, including the following common ones:

Dividing your data into a training set with X% of the observations and keeping the
rest as a holdout data set (a data se -This is the
most common technique.
K-folds cross validation-This strategy divides the data set into k parts and uses each
part one time as a test data set while using the others as a training data set. This has
the advantage that you use all the data available in the data set.
Leave-1 out-This approach is the same as k-folds but with k=1. You always leave one
observation out and train on the rest of the data. This is used only on small data sets,
to people evaluating laboratory experiments than to big data
analysts.

Predicting new observations

model that generalizes to unseen data. The process of applying your model to new data is
called model scoring. In fact, model scoring is something you implicitly did during

model enough to use it for real.

Model scoring involves two steps. First, you prepare a data set that has features
exactly as defined by your model. This boils down to repeating the data preparation you did
in step one of the modeling process but for a new data set. Then you apply the model on this
new data set, and this result in a prediction.

problem requires a different approach.

Types of machine learning

We can divide the different approaches to machine learning by the amount of human
effo m and how they use labeled data-data with a category
or a real-value number assigned to it that represents the outcome of previous observations.
Supervised learning techniques attempt to discern results and learn by trying to find
patterns in a labeled data set. Human interaction is required to label the data.
Unsupervised learning
patterns in a data set without human interaction.
Semi-supervised learning techniques need labeled data, and therefore human
interaction, to find patterns in the data set, but they can still progress toward a result
and learn even if passed unlabeled data as well.

Supervised learning

Supervised learning is a learning technique that can only be applied on labeled data.

case study on number recognition.

Case Study: Discerning Digits From Images

One of the many common approaches on the web to stopping computers from hacking
into user accounts is the Captcha check a picture of text and numbers that the human user
must decipher and enter into a form field before sending the form back to the web server.
Something like figure 3.3 should look familiar.

With the help of the Naïve Bayes classifier, a simple yet powerful algorithm to categorize
l in the sidebar, you can recognize
many websites
Step1. Our research goal is to let a computer recognize images of numbers.

Introducing Naïve Bayes classifiers in the context of a spam filter

Not every email you receive has honest intentions. Your inbox can contain unsolicited

and as a carrier for viruses. Kaspersky3 estimates that more than 60% of the emails in the
world are spam. To protect users from spam, most email clients run a program in the
background that classifies emails as either spam or safe.

A popular technique in spam filtering is employing a classifier that uses the words inside the
mail as predi
composed of (in mathematical terms, P(spam | words) ). To reach this conclusion it uses three
calculations:

P(spam)-The average rate of spam without knowledge of the words. According to


Kaspersky, an email is spam 60% of the time.
P(words)-How often this word combination is used regardless of spam.
P(words | spam)-How often these words are seen when a training mail was labeled as
spam.

To determine the chance that a new email is

P(spam|words) = P(spam)P(words|spam) / P(words)

rule and which lends its name to this classifier.

Step 2 of the data science process: fetching the digital image data

gray image, you put a value in every matrix entry that depicts the gray value to be shown.
The following code demonstrates this process and is step four of the data science process:
data exploration.
actual code output, but perhaps figure 3.5 can clarify this slightly, because it shows how each

more work to do. The Naïve Bayes classifier is expecting a list of values, but [Link]()
returns a two-dimensional array (a matrix) reflecting the shape of the image. To flatten it into
a list, we need to call reshape() on [Link]. The net result will be a one-dimensional
array that looks something like this:

array([[ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0.,
0., 4., 12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5.,
10., 12., 0., 0., 0., 0., 6., 13., 10., 0., 0., 0.]])
The end result of this code is called a confusion matrix, such as the one shown in figure 3.6.
Returned as a two-dimensional array, it shows how often the number predicted was the
correct number on the main diagonal and also in the matrix entry (i,j), where j was predicted
but the image showed i. Looking at figure 3.6 we can see that the model predicted the number
2 correctly 17 times (at coordinates 3,3), but also that the model predicted the number 8 15
times when it was actually the number 2 in the image (at 9,3).
Unsupervised learning

Instead, we must take the approach that will work with this data because,

We can study the distribution of the data and infer truths about the data in differ ent
parts of the distribution.
We can study the structure and values in the data and infer new, more meaningful data
and structure from it.

Many techniques exist for each of these unsupervised learning approaches. However, in the

data science process, so you may need to combine or try different techniques before either a
data set can be labeled, enabling supervised learning techniques, perhaps, or even the goal
itself is achieved.

Discerning a Simplified Latent Structure from Your Data

Not everything can be measured. When you meet someone for the first time you
might try to guess whether they like you based on their behavior and how they respond. But

down from attending a funeral the week before? The point is that certain variables can be
immediately available while others can only be inferred and are therefore missing from your
data set. The first types of variables are known as observable variables and the second type
are known as latent variables. In our example, the emotional state of your new friend is a
latent variable. It definitely influences their judge
Deriving or inferring latent variables and their values based on the actual contents of a
data set is a valuable skill to have because

Latent variables can substitute for several existing variables already in the data set.
By reducing the number of variables in the data set, the data set becomes more
manageable, any further algorithms run on it work faster and predictions may become
more accurate.
Because latent variables are designed or targeted toward the defined research goal,
you lose little key information by using them.

Case Study: finding latent variables in a wine quality data set

Component Analysis

compare how well a set of latent variables works in predicting the quality of wine against the
,

1. How to identify and derive those latent variables.


2. How to analyze where the sweet spot is how many new variables return the most
utility-
scree plots in a moment.)

ponents of this example.

Data set-The University of California, Irvine (UCI) has an online repository of 325

Wine Quality Data Set for red wines created by P. Cortez, A. Cerdeira, F. Almeida, T.
,600 lines long and has 11 variables per line, as shown in
table 3.2.
Principal Component Analysis-A technique to find the latent variables in your data
set while retaining as much information as possible.
Scikit-learn-We use this library because it already implements PCA for us and is a
way to generate the scree plot.

With the initial data preparation behind you, you can execute the PCA. The resulting scree
plot (which will be explained shortly) is shown in figure 3.8. Because PCA is an explorative
technique, we now arrive at step four of the data science process: data exploration, as shown
in the following listing.
The plot generated from the wine data set is shown in figure 3.8. What you hope to
see is an elbow or hockey stick shape in the plot. This indicates that a few variables can
represent the majority of the information in the data set while the rest only add a little more.
In our plot, PCA tells us that reducing the set down to one variable can capture approximately
28% of the total information in the set (the plot is zero-based, so variable

one is at position zero on the x axis), two variables will capture approximately 17% more or
45% total, and so on. Table 3.3 shows you the full read-out.
An elbow shape in the plot suggests that five variables can hold most of the
information found inside the data. You could argue for a cut-off at six or seven variables
opt for a simpler data set versus one with less variance in data
against the original data set.

At this point, we could go ahead and see if the original data set recoded with five
latent variables is good enough to predict the quality of the wine accurately, but before we do,

Interpreting the new variables:

With the initial decision made to reduce the data set from 11 original variables to 5
interpret or name them based on
their relationships with the originals. Actual names are easier to work with than codes such as
lv1, lv2, and so on. We can add the line of code in the following listing to generate a table
that shows how the two sets of variables correlate.

The rows in the resulting table (table 3.4) show the mathematical correlation. Or, in English,
the first latent variable lv1, which captures approximately 28% of the total information in the
set, has the following formula.
Giving a useable name to each new variable is a bit trickier and would probably require

We can now recode the original data set with only the five latent variables. Doing this is data
preparation again, so we revisit step three of the data science process: data preparation. As
mentioned in chapter 2, the data science process is a recursive one and this is especially true
between step three: data preparation and step 4: data exploration.

Table 3.6 shows the first three rows with this done.

Comparing the accuracy of the original data set with latent variables:

latent variables rather

algorithm we saw in the previous example for supervised learning to help.

les could predict the wine quality


scores. The following listing presents the code to do this.
ad of the

predictive performance improves. The following listing shows how this is done.
Grouping similar observations to gain insight from the distribution of your data:

we attempt to divide our data set into observation subsets, or clusters, wherein observations
should be similar to those in the same cluster but differ greatly from the observations in other
clusters. Figure 3.10 gives you a visual idea of what clustering aims to achieve. The circles in
the top left of the figure are clearly close to each other while being farther away from the
others. The same is true of the crosses in the top right.

Scikit-learn implements several common algorithms for clustering data in its


[Link] module, including the k-means algorithm, affinity propagation, and spectral
,5 although

k-means is a good general-purpose algorithm with which to get started. However, like
all the clustering algorithms, you need to specify the number of desired clusters in advance,
which necessarily results in a process of trial and error before reaching a decent conclusion. It
also presupposes that all the data required for analysis is available already.

length and width, petal length and width, and so on -means

values, so you can end up with a different cluster every time you run the algorithm unless you
manually define the start values by specifying a seed (constant for the start value generator).

hierarchical clustering techniques.


Figure 3.11 shows the output of the iris classification.

This figure shows that even without using a label

classification with a result of 134 (50+48+36) correct


classifications out of 150.

ys need to choose between


supervised and unsupervised; sometimes combining
them is an option.
Semi-supervised learning

can use the more powerful supervised machine learning techniques, in reality we often start
with
learning techniques to analyze what we have and perhaps add labels to the data set, but it will
be prohibitively costly to label it all. Our goal then is to train our predictor models with as
little labeled data as possible. This is where semi-supervised learning techniques come in

Take for example the plot in figure 3.12. In this case, the data has only two labeled
observations; normally this is too few to make valid predictions.

A common semi-supervised learning technique is label propagation. In this technique,


you start with a labeled data set and give the same label to similar data points. This is similar
to running a clustering algorithm over the data set and labeling each cluster based on the
labels they contain. If we were to apply this approach to the data set in figure 3.12, we might
end up with something like figure 3.13.

One special approach to semi-supervised learning worth mentioning here is active


learning. In active learning the program points out the observations it wants to see labeled for
its next round of learning based on some criteria you have specified. For example, you might
set it to try and label the observations the algorithm is least certain about, or you might use
multiple models to make a prediction and select the points where the models disagree the
most.
With the basics of machine learning at your disposal, the next chapter discusses using
machine learning within the constraints of a single computer. This tends to be challenging
when the data set is too big to load entirely into memory.
UNIT IV
Introduction to Hadoop: Hadoop framework Spark replacing MapReduce NoSQL
ACID CAP BASE types.

Hadoop: a framework for storing and processing large data sets

Apache Hadoop is a framework that simplifies working with a cluster of computers. It


aims to be all of the following things and more:

Reliable-By automatically creating multiple copies of the data and redeploying


processing logic in case of failure.
Fault tolerant-It detects faults and applies automatic recovery.
Scalable-Data and its processing are distributed over clusters of computers
(horizontal scaling).
Portable-Installable on all kinds of hardware and operating systems.

The core framework is composed of a distributed file system, a resource manager, and
a system to run distributed programs. In practice it allows you to work with the distributed
file system almost as easily as with the local file system of your home computer. But in the
background, the data can be scattered among thousands of servers.

The Different Components of Hadoop:

A distributed file system (HDFS)


A method to execute programs on a massive scale (MapReduce)
A system to manage the cluster resources (YARN)

On top of that, an ecosystem of applications arose (figure 5.2), such as the databases Hive

chapter. Hive has a language based on the widely used SQL to interact with data stored inside
the database.
MapReduce: How Hadoop Achieves Parallelism

Hadoop uses a programming method called MapReduce to achieve parallelism. A


MapReduce algorithm splits up the data, processes it in parallel, and then sorts, com bines,

suited for interactive analysis or iterative programs because it writes the data to a disk in
between each computational step. This is expensive when working with large data sets.

director of a toy company. Every toy has two colors, and when a client orders a toy from the
web page, the web page puts an order file on Hadoop with the colors of the toy. Your task is
-style

As the name suggests, the process roughly boils down to two big phases:

Mapping phase - The documents are split up into key-value pairs. Until we reduce,
we can have many duplicates.
Reduce phase - rences
are grouped together, and depending on the reducing function, a different result can

returns.
The whole process is described in the following six steps and depicted in figure 5.4.
1. Reading the input files.
2. Passing each line to a mapper job.
3. The mapper job parses the colors (keys) out of the file and outputs a file for each
color with the number of times it has been encountered (value). Or more
technically said, it maps a key (the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one file
per key with the total number of occurrences for each color.
6. The keys are collected in an output file.

Spark: replacing MapReduce for better performance

Data scientists often do interactive analysis and rely on algorithms that are inherently
iterative; it can take awhile until an algorithm converges to a solution. As this is a weak point
uce the Spark Framework to overcome it. Spark
improves the performance on such tasks by an order of magnitude.

What Is Spark?

Spark is a cluster computing framework similar to MapReduce. Spark, however,

resource management. For this it relies on systems such as the Hadoop File System, YARN,
or Apache Mesos. Hadoop and Spark are thus complementary systems. For testing and
development, you can even run Spark on your local system.
How Does Spark Solve The Problems Of MapReduce?

While we oversimplify things a bit for the sake of clarity, Spark creates a kind of
shared RAM memory between the computers of your cluster. This allows the different
workers to share variables (and their state) and thus eliminates the need to write the
interme
uses Resilient Distributed Datasets (RDD), which are a distributed memory abstraction that
lets programmers perform in-memory computations on large clusters in a fault tolerant way.1
-memory system, it avoids costly disk operations.

The Different Components of the Spark Ecosystem:

Spark core provides a NoSQL environment well suited for interactive, exploratory
analysis. Spark can be run in batch and interactive mode and supports Python.

Spark has four other large components, as listed below and depicted in figure 5.5.

1. Spark streaming is a tool for real-time analysis.


2. Spark SQL provides a SQL interface to work with Spark.
3. MLLib is a tool for machine learning inside the Spark framework.
4.
chapter 7.
Introduction to NoSQL

data bases successfully over multiple nodes, but also to present fundamentally different ways
to model the data at hand to fit its structure to its use case and not to how a relational database
requires it to be modeled.

principles of single-server relational databases and show how NoSQL databases rewrite them
into BASE principles
CAP theorem, which describes the main problem with distributing data bases across multiple
nodes and how ACID and BASE databases approach it.

ACID: the core principle of relational databases

Atomicity-
put in completely or not at all. If, for instance, a power failure occurs in the

wou
Consistency-This important principle maintains the integrity of the data. No
entry that makes it into the database will ever be in conflict with predefined rules,
such as lacking a required field or a field being numeric instead of text.
Isolation-When something is changed in the database, nothing can happen on this
exact same data at exactly the same moment. Instead, the actions happen in serial
with other changes. Isolation is a scale going from low isolation to high isolation.
On t
Durability-If data has entered the database, it should survive permanently.
Physical damage to the hard discs will destroy records, but power outages and
software crashes should not.

CAP Theorem: the problem with DBs on many nodes:

ACID principle because of the consistency ACID promises; The CAP Theorem states that a
database can be any two of the following things but never all three:
Partition tolerant-The database can handle a network partition or network
failure.
Available-
can connect to it, the node will respond, even if the connection between the
different database nodes is lost.
Consistent-
same data.

For a single-

Available-
all the CAP availability promises.
Consistent- no second node, so nothing can be inconsistent.

Things get interesting once the database gets partitioned. Then you need
to make a choice between availability and consistency, as shown in
figure 6.2.

The BASE principles of NoSQL databases

such as the document stores and key-value stores, follow BASE. BASE is a set of much
softer database promises:

Basically available-Availability is guaranteed in the CAP sense. Taking the web


shop example, if a node is up and running, you can keep on shopping. Depending
on how things are set up, nodes can take over from other nodes.
Soft state-The state of a system might change over time. This corresponds to the
eventual consistency principle: the system might have to change to make the data

Eventual consistency-The database will become consistent over time. In the web
shop example, the table is sold twice, which results in data inconsistency. Once
the connection between the individual nodes is
communicate and decide how to resolve it.
NoSQL database types:

There are four big NoSQL types,

1. Key-value store
2. Document store
3. Column-oriented database
4. Graph database

A full-scale relational database can be made up of many entities and linking tables. Now that
ok at the different types.

1. Key-Value Stores:

Key-value stores are the least complex


of the NoSQL databases. They are, as the
name suggests, a collection of key-value pairs,
as shown in figure 6.11, and this simplicity
makes them the most scalable of the NoSQL
database types, capable of storing huge
amounts of data.
2. Document Stores

Document stores are one step up in complexity from key-value stores: a document
store does assume a certain document structure that can be specified with a schema.
Docum
designed to store everyday documents as is, and they allow for complex querying and
calculations on this often already aggregated form of data.

3. Column-Oriented Database

Traditional relational databases are row-oriented, with each row having a row id and

extra data about hobbies is stored and you have only a single table to describe people, as
shown in figure 6.8. Notice how in this scenario you have slight denormalization because
hobbies could be repeated. If the hobby information is a nice extra but not essential to your
use case, adding it as a list within the Hobbies column is an acceptable approach.
Every time you look up something in a row-oriented database, every row is scanned,

September. The database will scan the table from top to bottom and left to right, as shown in
figure 6.9, eventually returning the list of birthdays.

Indexing the data on certain columns can significantly improve lookup speed, but indexing
every column brings extra overhead and the database is still scanning all the columns.
Column databases store each column separately, allowing for quicker scans when only a
small number of columns is involved; see figure 6.10.

4. Graph Databases

The last big NoSQL database type is the most complex one, geared toward storing
relations between entities in an efficient manner. When the data is highly interconnected,
such as for social networks, scientific paper citations, or capital asset clusters, graph
databases are the answer. Graph or network data has two main components:
Node-The entities themselves. In a social network this could be people.
Edge-The relationship between two entities. This relationship is represented by a line
and has its own properties. An edge can have a direction, for example, if the arrow
indicates who is whose boss.

Graphs can become incredibly complex given enough relation and entity types. Figure
6.14 already shows that complexity with only a limited number of entities. Graph
databases like Neo4j also claim to uphold ACID, whereas document stores and key value
stores adhere to BASE.
QUESTION BANK:
UNIT I
1. Data is mainly used to:
a) Confuse people b) Make better decisions
c) Increase mistakes d) Reduce accuracy
Answer: b
2. Raw data is best described as:
a) Organized facts b) Unprocessed information
c) Final results d) Cleaned data
Answer: b
3. The first step in the data science process is:
a) Modeling b) Deployment
c) Data collection d) Evaluation
Answer: c
4. Big data mainly deals with:
a) Small data b) Large and complex datasets
c) Only numbers d) Only images
Answer: b

a) Only software b) A set of tools and technologies working together


c) Only hardware d) A biological system
Answer: b
6. Which of the following is not a characteristic of data?
a) Structured b) Semi-structured
c) Unstructured d) Invisible data
Answer: d
7. The step where insights and patterns are found is:
a) Data cleaning b) Data analysis
c) Deployment d) Data deletion
Answer: b
8. A popular big data technology is:
a) MS Word b) Hadoop
c) Calculator d) Notepad
Answer: b
9. Data science mainly combines:
a) Painting and music b) Statistics, programming, domain knowledge
c) Dancing and singing d) Only mathematics
Answer: b
10. Processed data is known as:
a) Information b) Raw data
c) Noise d) Files
Answer: a

5-MARK QUESTIONS
1. Explain any five benefits and uses of data in organizations.
2. What are the different types of data? Explain with examples.
3. Write a short note on the data science process.

5. Describe the difference between data and information.


6. Explain any five tools used in the big data ecosystem.
7. What is raw data? Explain its characteristics.
8. Describe the main steps involved in data science with one-line explanations.
10-MARK QUESTIONS
1. Explain the data science process in detail with a neat diagram. Discuss each step clearly.
2. Describe the big data ecosystem in detail. Explain major components such as Hadoop,
Spark, NoSQL, and visualization tools.
3. Discuss the benefits and uses of data with examples from different fields such as
healthcare, business, education, and banking.
4. Explain the different types of data (structured, semi-structured, unstructured) and how they
are used in data science.
5. Write a detailed note on the role of big data and data science in modern industries. Provide
examples.
6. Compare traditional data processing with big data processing. Explain why big data
technologies are needed.
7. Explain the complete workflow from collecting data to deploying a data science model.
8. Discuss in detail the major tools and frameworks used in data science and big data
environments.
UNIT II
1. The first step in the data science process is:
a) Model building b) Overview / understanding the problem
c) Deployment d) Testing
Answer: b
2. Research goals help to:
a) Make the project confusing b) Give direction and purpose to the project
c) Slow down the process d) Increase errors
Answer: b

3. Retrieving data means:


a) Deleting data b) Collecting or getting data from different sources
c) Packaging data d) Drawing graphs
Answer: b

4. Data transformation includes:


a) Changing data into a suitable format
b) Destroying data
c) Hiding errors
d) Printing results
Answer: a

5. Exploratory Data Analysis (EDA) is done to:


a) Ignore patterns b) Understand patterns, trends, and relationships in data
c) Remove all data d) Increase missing values
Answer: b

6. Model building is mainly used to:


a) Predict outputs or classify data b) Paint pictures
c) Remove important columns d) Play games
Answer: a

7. A common technique used in EDA is:


a) Cooking b) Statistical summaries and visualizations
c) Making phone calls d) Adding random numbers
Answer: b
8. During model building, training means:
a) Teaching the model using known data
b) Removing rows
c) Creating folders
d) Copying files
Answer: a
9. Which of the following is used for retrieving data?
a) Databases b) Video games
c) Music players d) Stickers
Answer: a
10. Data transformation helps in:
a) Improving quality and consistency of data
b) Making data unreadable
c) Increasing errors
d) Mixing random values
Answer: a
5-MARK QUESTIONS
1. Explain the importance of defining research goals in a data science project.
2. What is data retrieval? Explain any three sources of retrieving data.
3. Write a short note on data transformation with examples.
4. What is Exploratory Data Analysis (EDA)? Explain its main purpose.
5. Describe the main steps involved in model building.
6. Explain the role of overview and understanding the problem in data science.
7. Differentiate between retrieving data and transforming data.
8. Explain any five techniques used in Exploratory Data Analysis.
10-MARK QUESTIONS
1. Describe the complete data science process starting from overview to model building.
Provide a neat diagram.
2. Explain in detail the importance of research goals. Discuss how they guide each step of the
data science process.
3. Discuss data retrieval in detail. Explain various internal and external data sources with
examples.
4. Explain the process of data transformation. Describe cleaning, integration, normalization,
and feature engineering.
5. What is Exploratory Data Analysis? Explain its techniques, tools, and role in discovering
patterns in data with examples.
6. Write a detailed note on model building. Explain training, testing, validation, model
selection, and evaluation.
7. Compare and contrast the steps: retrieving data, transforming data, and performing EDA.
Explain how each step prepares data for modeling.
8. Discuss how a clear overview and well-defined research goals influence the success of data
science projects. Provide real-world examples.

UNIT III
1. Machine learning algorithms help systems to:
a) Sleep b) Learn from data
c) Draw cartoons d) Delete files
Answer: b
2. Supervised learning uses:
a) No data b) Only images
c) Labeled data d) Only videos
Answer: c
3. Unsupervised learning deals with:
a) Labeled data b) Unlabeled data
c) Audio files only d) No patterns
Answer: b
4. Semi-supervised learning uses:
a) Only unlabeled data
b) Only text
c) A small amount of labeled data + large unlabeled data
d) No data at all
Answer: c
5. The first step in the modeling process is:
a) Testing
b) Understanding the problem
c) Deployment
d) Deleting data
Answer: b
6. An example of a supervised learning algorithm is:
a) K-Means b) Linear Regression
c) Apriori d) PCA
Answer: b
7. An example of an unsupervised learning algorithm is:
a) Decision Tree b) Logistic Regression
c) K-Means Clustering d) Naive Bayes
Answer: c
8. Semi-supervised learning is useful when:
a) Labeled data is expensive b) Labeled data is unlimited
c) No data is available d) Models are not needed
Answer: a
9. The modeling process includes:
a) Training, testing, evaluation
b) Playing games
c) Sending emails
d) Copying folders
Answer: a
10. Supervised learning problems are mainly of two types:
a) Noise and error b) Classification and regression
c) Groups and clusters d) None
Answer: b
5-MARK QUESTIONS
1. Explain supervised learning with examples.
2. What is unsupervised learning? Describe any two common algorithms.
3. Write a short note on semi-supervised learning.
4. Explain the major steps in the modeling process.
5. List any five machine learning algorithms and explain their purpose.
6. Differentiate between supervised and unsupervised learning.
7. Explain the importance of labeled and unlabeled data in ML types.
8. Describe the role of training and testing in model building.
10-MARK QUESTIONS
1. Explain in detail the three types of machine learning: supervised, unsupervised, and semi-
supervised. Provide examples and applications for each.
2. Discuss the modeling process in detail. Explain data preparation, model selection, training,
validation, testing, and evaluation.
3. Explain supervised learning. Discuss classification and regression algorithms with real-
time examples.
4. Write a detailed note on unsupervised learning. Explain clustering, dimensionality
reduction, and association rules.
5. Discuss semi-supervised learning. Explain how it combines features of supervised and
unsupervised learning with examples.
6. Compare supervised, unsupervised, and semi-supervised learning. Explain similarities,
differences, advantages, and limitations.
7. Explain machine learning algorithms in detail. Describe at least five algorithms (e.g.,
Linear Regression, Decision Tree, KNN, K-Means, Naive Bayes).
8. Describe how the modeling process helps in building, validating, and improving machine
learning models. Provide a neat diagram.

UNIT IV

1. Hadoop mainly deals with:


a) Small data b) Big data
c) Only text data d) No data
Answer: b
2. The storage system used in Hadoop is:
a) NTFS b) HDFS
c) FAT32 d) APFS
Answer: b
3. Spark is faster than MapReduce because it uses:
a) Disk processing b) In-memory processing
c) CD-ROM d) Cloud only
Answer: b
4. The main reason Spark replaces MapReduce is:
a) It is slower b) It supports real-time processing
c) It has fewer features d) It cannot handle big data
Answer: b
5. NoSQL databases are used for:
a) Structured data only b) Unstructured and semi-structured data
c) Only images d) Only tables
Answer: b
6. ACID properties ensure:
a) Low performance b) Reliable and consistent transactions
c) Random processing d) Temporary data
Answer: b
7. In the CAP theorem, "A" stands for:
a) Availability b) Accuracy
c) Action d) Automation
Answer: a
8. BASE stands for:
a) Basic Application Software Engine
b) Basically Available, Soft state, Eventual consistency
c) Binary Accessible System Entry
d) Basic Array Storage Engine
Answer: b
9. A type of NoSQL database is:
a) Relational database b) Key-value store
c) Spreadsheet d) SQL dump
Answer: b
10. Hadoop ecosystem includes:
a) Word and Excel b) HDFS, MapReduce, YARN
c) Gmail only d) Operating systems
Answer: b
5-MARK QUESTIONS

1. Explain the architecture and components of the Hadoop framework.


2. Write a short note on Apache Spark and its advantages.
3. Why is Spark considered a replacement for MapReduce? Explain.
4. What is NoSQL? Describe any two NoSQL database types.
5. Explain ACID properties with simple examples.
6. What is the CAP theorem? Describe each of its components.
7. Explain BASE properties and how they differ from ACID.
8. Describe the main types of NoSQL databases: Key-Value, Document, Column, Graph.

10-MARK QUESTIONS

1. Explain the Hadoop framework in detail. Discuss HDFS, YARN, and MapReduce with a
neat diagram.
2. Discuss Apache Spark in detail. Explain its architecture, features, advantages, and why it is
faster than MapReduce.
3. Compare MapReduce and Spark. Explain how Spark replaces MapReduce along with real-
world examples.
4. Explain NoSQL databases in detail. Discuss characteristics, benefits, and all major types
with examples.
5. Describe ACID properties in detail. Explain how they ensure reliable database transactions
with real-life examples.
6. Explain the CAP theorem in depth. Discuss trade-offs between consistency, availability,
and partition tolerance in distributed systems.
7. Discuss BASE properties in detail. Compare ACID vs BASE and explain their application
in modern databases.
8. Write a detailed note on NoSQL database types: Key-value store, Document store, Wide-
column store, Graph store. Add examples and use cases.
UNIT V
1. What is the first step in a disease prediction case study?
a) Model evaluation b) Setting research goals
c) Data visualization d) Deployment
Answer: b
2. Data retrieval mainly deals with:
a) Cleaning data b) Collecting data from various sources
c) Visualizing results d) Deploying a model
Answer: b
3. Which of the following is part of data preparation?
a) Identifying symptoms b) Handling missing values
c) Deploying the model d) Writing conclusions
Answer: b
4. Exploratory Data Analysis (EDA) helps in:
a) Removing algorithms b) Understanding patterns in data
c) Encrypting data d) Creating user interfaces
Answer: b
5. Disease profiling means:
a) Visualizing patient names b) Identifying disease patterns & features
c) Predicting salaries d) Backing up the database
Answer: b
6. Automation in disease prediction refers to:
a) Making manual reports b) Automating prediction tasks
c) Removing data sources d) Storing paper records
Answer: b
7. Which data is commonly used in disease prediction?
a) Weather data b) Medical & clinical data
c) Vehicle data d) Movie rating data
Answer: b
8. In EDA, correlation helps identify:
a) Missing files b) Relationships between variables
c) User passwords d) Hardware issues
Answer: b
9. A predictive model for diseases generally outputs:
a) A picture b) A prediction (Yes/No or category)
c) A poem d) A video
Answer: b
10. Presenting results in disease prediction often includes:
a) Charts and reports b) Games
c) Audio recordings d) None
Answer: a
FIVE-MARK QUESTIONS
1. Explain the process of setting research goals for a disease prediction project.
2. Describe the methods used for disease-related data retrieval.
3. What are the key steps in data preparation for disease prediction?
4. Write a short note on Exploratory Data Analysis (EDA) in disease prediction.
5. Explain disease profiling with suitable examples.
6. What is the role of data cleaning in effective disease prediction?
7. Describe the importance of visualization in predicting diseases.
8. Mention any five challenges in data retrieval and preparation for medical datasets.
9. Write the need for automation in disease prediction systems.
10. Explain the significance of feature selection in disease prediction.

TEN-MARK QUESTIONS
1. Explain the complete workflow of a disease prediction case study from setting research
goals to final presentation.
2. Discuss in detail the process of data retrieval, preparation, and exploration for disease
prediction with examples.
3. Describe how EDA helps in understanding patterns and symptoms in disease datasets.
Give suitable charts or examples.
4. Explain disease profiling in detail. How does profiling help in diagnosing and predicting
specific diseases?
5. Write a detailed note on building predictive models for diseases. Explain data preparation,
algorithm selection, training, and evaluation.
6. Discuss how automation improves disease prediction systems. Explain tools, workflows,
and real-life examples.
7. Compare manual disease analysis with automated disease prediction systems. Explain
benefits and limitations.
8. Explain the challenges of working with medical data in disease prediction privacy,
missing data, noise, imbalance, etc.

--------------------End--------------------

You might also like