CS3352 Foundations of Data Science Syllabus
CS3352 Foundations of Data Science Syllabus
Year/Semester/Section : II/III/A
CSE 2024-25
Course Code & Course Name : CS3352 & Foundations of Data Science
Name of the Faculty : [Link]
Year/Sem/Sec : II/III/A
Available
S. No. Content
(Yes/No)
1. Syllabus Copy
2. Lesson Plan
3. Preamble
4. Subject Timetable
5. Minutes of the Course Committee Meeting (2 Meetings/Semester)
6. Course Material (Lecture Handouts)
8. Question Bank
10. CIA II – Question Paper, Answer Key & Sample Papers (3 Nos)
11. CIA III – Question Paper, Answer Key & Sample Papers (3 Nos)
12. Model Exam- Question Paper, Answer Key & Sample Papers (3 Nos)
13. MKC Material
Faculty Signature :
Course Code & Course Name : CS3352 & Foundations of Data Science L T P C
Course Objectives
Course Outcomes
Define the data science process
Understand different types of data description for data science process
UNIT I INTRODUCTION 9
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research
goals – Retrieving data – Data preparation - Exploratory Data analysis – build the model– presenting
findings and building applications - Data Mining - Data Warehousing – Basic Statistical descriptions
of Data
UNIT II DESCRIBING DATA 9
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing data with
Averages - Describing Variability - Normal Distributions and Standard (z) Scores
Total : 45
Text Books:
Year of
[Link]. Author(s) Title of the Book Publisher
Publication
David Cielen, Manning
1. Arno D. B. Introducing Data Science 2016
Meysman Publications
Robert S. Witte
2. Statistics Wiley Publications 2017
and John S. Witte
3. Jake VanderPlas Python Data Science Handbook O’Reilly 2016
Reference Books:
Year of
[Link]. Author(s) Title of the Book Publisher
Publication
Think Stats: Exploratory Data
1. Allen B. Downey Green Tea Press 2014
Analysis in Python
41 Colors – Subplots A
Actual Date of
Proposed Mode of
Topics Covered Lecture
[Link] Delivery
Date Period Date Period
45 Visualization With A
Seaborn
Content Beyond the Syllabus
Mode of Delivery:
K:Technical Debate
CSE 2024-25
Course Code & Course Name : CS3352 & Foundations of Data Science
Name of the Faculty : [Link]
Year/Sem/Sec : II / III / A Date :
Prerequisite knowledge
for complete learning of Requires in depth knowledge of data science and python libraries.
subject
Commercial applications
Human Resource management
Benefit / Application of Financial applications
this Subject
Fraud detection
Customer analysis
VIII
Topics Mapping for
Future Semester Subjects Project Work
David Cielen, Arno D. B. Meysman, and Mohamed Ali,
“Introducing Data Science”, Manning Publications, 2016.
(Unit I)
Robert S. Witte and John S. Witte, “Statistics”, Eleventh
Important Books / Edition, Wiley Publications, 2017. (Units II and III)
Journals for Learning Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
(Units IV and V)
Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”,
Green Tea Press,2014
Course Faculty HoD
Year/Sem/Sec : II/III/A
Day/ 1 2 - 3 4 - 5 6 7
Hour/
09.10- 10.05- 11.00- 11.15- 12.10- 01.00- 01.45- 02.40- 03.35-
Time
10.05 11.00 11.15 12.10 01.00 01.45 02.40 03.35 04.30
Monday DM DS OOPS FDS DS LAB
Lunch Break
Wednesday OOPS DS FDS DM OOPS LAB
Thursday NM NM NM
Course
Course Name Abbreviation Faculty Name
Code
MA3354 Discrete Mathematics DM [Link]
Digital principles and computer [Link] kumar &
CS3351 DPCO
organization [Link]
CS3352 Foundations of data science FDS [Link]
CS3301 Data structures DS [Link]
CS3391 Object oriented programming OOPS [Link]
CS3311 Data structure laboratory DS LAB [Link]
Object oriented programming
CS3381 OOPS LAB [Link]
laboratory
CS3361 Data science laboratory FDS LAB [Link]
Class Advisor [Link]
Exam Coordinator [Link]
Placement Coordinator [Link]
Course Code & Course Name : CS3352 & Foundations of Data Science
[Link]. Course Faculty Name Branch Year/ Sem Section Faculty Signature
1. [Link]
CSE II/III A
2. [Link]
Points Discussed:
Planned to give homework problems in difficult topics
Discussed about portion completion
To check and correct students handwritten notes periodically.
Ask to give question bank
Course Code & Course Name : CS3352 & Foundations of Data Science
[Link]. Course Faculty Name Branch Year/ Sem Section Faculty Signature
1. [Link]
CSE II/III A
2. [Link]
Points Discussed:
To make the students to concentrate more on previous year anna university questions.
To make the students understand the subject knowledge
Discussed about CIA I students feedback and performance
Identify the weak students and give home test and assessment
CSE II/III
Data Science :
Data Science along with artificial intelligence (AI) and its various components
such as statistical learning (SL), machine learning (ML) and deep learning
algorithms (DL) are recognized as main drivers of organizational value
creation. According to Dr Jim Gray, Data Science is the fourth paradigm which
drives innovative solutions to organizational problems.
This course is suitable for students/practitioners interested in improving their
knowledge in the fundamental concepts of Data Science. The course will also
prepare the learner for a career in the field of Data Analytics.
Prerequisite knowledge for Complete understanding and learning of Topic:
Python
Discrete mathematics
Statistics
Detailed content of the Lecture:
Introduction:
Data
In computing, data is information that has been translated into a form that is
efficient for movement or processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with
the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
Benefits and uses of data science
Data science and big data are used almost everywhere in both commercial and noncommercial
Settings
Commercial companies in almost every industry use data science and big data
to gain insights into their customers, processes, staff, completion, and
products.
Many companies use data science to offer customers a better user experience,
as well as to cross-sell, up-sell, and personalize their offerings.
Governmental organizations are also aware of data’s value. Many
governmental organizations not only rely on internal data scientists to discover
valuable information, but also share their data with the public.
Nongovernmental organizations (NGOs) use it to raise money and defend their
causes.
Universities use data science in their research but also to enhance the study
experience of their students. The rise of massive open online courses (MOOC)
produces a lot of data, which allows universities to study how this type of
learning can complement traditional classes.
Data Science Process
iterate
Defining research goals
A project starts by understanding the what, the why, and the how
of your project. The outcome should be a clear research goal, a good
understanding of the context, well-defined deliverables, and a plan of
action with a timetable. This information is then best placed in a project
charter.
Course Faculty
Verified by HoD
LECTURE HANDOUTS
L2
CSE II/III
Python
Discrete mathematics
Statistics
Detailed content of the Lecture:
Facets of Data:
Structured data
Unstructured data
Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying. One example of unstructured data
Natural language
“Graph data” can be a confusing term because any data can be shown in
a graph.
Graph or network data is, in short, data that focuses on the relationship
or adjacency of objects.
The graph structures use nodes, edges, and properties to represent and
store graphical data.
Graph-based data is a natural way to represent social networks,
and its structure allows you to calculate specific metrics such as
the influence of a person and the shortest path between two people.
Audio, image, and video
Audio, image, and video are data types that pose specific challenges to a
data scientist.
Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
MLBAM (Major League Baseball Advanced Media) announced in
2014 that they’ll increase video capture to approximately 7 TB per
game for the purpose of live, in-game analytics.
Recently a company called Deep Mind succeeded at creating an
algorithm that’s capable of learning how to play video games.
This algorithm takes the video screen as input and learns to
interpret everything via a complex of deep learning.
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
“Data science from scratch”, author :joel grus, page nos 403
“Essential math of data science”, author:Thomas neild, page nos 347
“Become a data head”, author::alex gutman, page nos 272
Course Faculty
Verified by HoD
LECTURE HANDOUTS L3
CSE II/III
Python
Discrete mathematics
Statistics
Data Science Process:
4. The fourth step is data exploration. The goal of this step is to gain a
deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to
start modeling.
5. Finally, we get to model building (often referred to as “data
modeling” throughout this book). It is now that you attempt to gain
the insights or make the predictions stated in your project charter.
Now is the time to bring out the heavy guns, but remember research
has taught us that often (but not always) a combination of simple
models tends to outperform one complicated model. If you’ve done
this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and
automating the analysis, if needed. One goal of a project is to change a
process and/or make better decisions. You may still need to convince the
business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The
importance of this step is more apparent in projects on a strategic and
tactical level. Certain projects require you to perform the business
process over and over again, so automating the project will save time
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
“Data science from scratch”, author :joel grus, page nos 403
“Essential math of data science”, author:Thomas neild, page nos 347
“Become a data head”, author::alex gutman, page nos 272
C
ourse Faculty
Verified by HoD
LECTURE HANDOUTS L4
CSE II/III
Python
Discrete mathematics
Statistics
Detailed content of the Lecture:
Retrieving data
The next step in data science is to retrieve the required data.
Sometimes you need to go into the field and design a data
collection process yourself, but most of the time you won’t be
involved in this step.
Many companies will have already collected and stored the data for
you, and what they don’t have can often be bought from third
parties.
More and more organizations are making even high-quality data
freely available for public and commercial use.
Data can be stored in many forms, ranging from simple text files to
tables in a database. The objective now is acquiring all the data you
need.
Most companies have a program for maintaining key data, so much
of the cleaning work may already be done. This data can be stored
in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT
professionals.
Data warehouses and data marts are home to preprocessed data,
data lakes contain data in its natural or raw format.
Externl Data
If data isn’t available inside your organization, look outside your
organizations. Companies provide data so that you, in turn, can
enrich their services and ecosystem. Such is the case with
Twitter, LinkedIn, and Facebook.
More and more governments and organizations share their data for free
with the world.
A list of open data providers that should get you started.
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
“Data science from scratch”, author :joel grus, page nos 403
“Essential math of data science”, author:Thomas neild, page nos 347
“Become a data head”, author::alex gutman, page nos 272
Cou
rse Faculty
Veri
fied by HoD
LECTURE HANDOUTS L5
CSE II/III
Data in a specific format, so data transformation will always come into [Link]’s a
good habit to correct data errors as early on in the process as [Link],
this isn’t always possible in a realistic setting, so you’ll need to take corrective
actions in your program.
Prerequisite knowledge for Complete understanding and learning of Topic:
Python
Discrete mathematics
Statistics
Introduction to Data Preparation:
Your model needs the data in a specific format, so data transformation will
always come into play. It’s a good habit to correct data errors as early on in
the process as possible. However, this isn’t always possible in a realistic
setting, so you’ll need to take corrective actions in your program.
Cleansing data
Data cleansing is a sub process of the data science process that focuses on
removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
The first type is the interpretation error, such as when you take the
value in your data for granted, like saying that a person’s age is greater
than 300 years.
The second type of error points to inconsistencies between data
sources or against your company’s standardized values.
An example of this class of errors is putting “Female” in one table and “F”
in another when they represent the same thing: that the person is female.
x = “Bad
Redundant Whitespace
Whitespaces tend to be hard to detect but cause errors like other redundant
characters would.
The whitespace cause the miss match in the string such as “FR ” –
“FR”, dropping the observations that couldn’t be matched.
If you know to watch out for them, fixing redundant whitespaces is
luckily easy enough in most programming languages. They all provide
string functions that will remove the leading and trailing whitespaces.
For instance, in Python you can use the strip() function to remove
leading and trailing spaces.
Outliers
An outlier is an observation that seems to be distant from other observations
or, more specifically, one observation that follows a different logic or
generative process than the other observations. The easiest way to find
outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows
possible outliers on the upper side when a normal distribution is expected.
Joining Tables
Joining tables allows you to combine the information of one observation
found in one table with the information that you find in another table.
The focus is on enriching a single observation.
Let’s say that the first table contains information about the purchases
of a customer and the other table contains information about the region
where your customer lives.
Joining the tables allows you to combine the information so that you
can use it for your model, as shown in figure.
Transforming data
a relationship of the form y = aebx. Taking the log of the independent variables
simplifies the estimation problem dramatically. Transforming the input variables
greatly simplifies the estimation problem. Other times you might want to combine
two variables into a new variable.
Dummy variables can only take two values: true (1) or false (0). They’re
used to indicate the absence of a categorical effect that may explain
the observation.
In this case you’ll make separate columns for the classes stored in one
variable and indicate it with 1 if the class is present and 0 otherwise.
An example is turning one column named Weekdays into the columns
Monday through Sunday. You use an indicator to show if the
observation was on a Monday; you put 1 on Monday and 0 else 1.
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
Course Faculty
Verified by HoD
L6
LECTURE HANDOUTS
CSE II/III
Exploratory data analysis you take a deep dive into the data (see figure below).
Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables.
Python
Discrete mathematics
Statistics
Detailed content of the Lecture:
Exploratory data analysis
During exploratory data analysis you take a deep dive into the data (see figure
below). Information becomes much easier to grasp when shown in a picture,
therefore you mainly use graphical techniques to gain an understanding of your data
and the interactions between variables.
The goal isn’t to cleanse the data, but it’s common that you’ll still discover
anomalies you missed before, forcing you to take a step back and fix them.
EDA is not identical to statistical graphics although the two terms are used
almost interchangeably. Statistical graphics is a collection of techniques--all
graphically based and all focusing on one data characterization aspect. EDA
encompasses a larger venue; EDA is an approach to data analysis that
postpones the usual assumptions about what kind of model the data follow with
the more direct approach of allowing the data itself to reveal its underlying
structure and model. EDA is not a mere collection of techniques; EDA is a
philosophy as to how we dissect a data set; what we look for; how we look; and
how we interpret. It is true that EDA heavily uses the collection of techniques
that we call "statistical graphics", but it is not identical to statistical
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS L7
CSE II/III
Topic of Lecture: Build the model – Presenting findings and building Applications
Introduction :
The principle here is simple: the model should work on unseen data. You
use only a fraction of your data to estimate the model and the other part, the
holdout sample, is kept out of the equation.
Python
Discrete mathematics
Statistics
Model Building:
Building a model is an iterative process. The way you build your model
depends on whether you go with classic statistics or the somewhat more
recent machine learning school, and the type of technique you want to
use. Either way, most models consist of the following main steps:
Selection of a modeling technique and variables to enter in the model
Execution of the model
Diagnosis and model comparison
Model and variable selection
You’ll need to select the variables you want to include in your model and
a modeling technique. You’ll need to consider model performance and
whether your project meets all the requirements to use your model, as
well as other factors:
Must the model be moved to a production environment and, if so,
would it be easy to implement?
How difficult is the maintenance on the model: how long will it remain
relevant if left untouched?
Does the model need to be easy to explain?
Model execution
Once you’ve chosen a model you’ll need to implement it in code.
Most programming languages, such as Python, already have
libraries such as StatsModels or Scikit- learn. These packages use
several of the most popular techniques.
Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. As you can see in the
following code, it’s fairly easy to use linear regression with
StatsModels or Scikit-learn
Doing this yourself would require much more effort even for the
simple techniques. The following listing shows the execution of a
linear prediction model.
Mean square error is a simple measure: check for every prediction how
far it was from the truth, square this error, and add up the error of every
prediction.
To estimate the models, we use 800 randomly chosen
observations out of 1,000 (or 80%), without showing the other
20% odata to the model.
Once the model is trained, we predict the values for the other 20%
of the variables based on those for which we already know the true
value, and calculate the model error with an error measure.
Then we choose the model with the lowest error. In this example
we chose model 1 because it has the lowest total error.
Sometimes people get so excited about your work that you’ll need
to repeat it over and over again because they value the
predictions of your models or the insights that you produced.
This doesn’t always mean that you have to redo all of your analysis
all the time. Sometimes it’s sufficient that you implement only the
model scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or PowerPoint
presentations. The last stage of the data science process is where
your soft skills will be most useful, and yes, they’re extremely
important.
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by
HoD
LECTURE HANDOUTS L8
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Course Faculty : [Link]
Unit : I - Introduction Date of Lecture:
SQL
Detailed content of the Lecture:
Preparing Data
The second step in the data mining process is to consolidate and
clean the data that was identified in the Defining the Problem step.
Data can be scattered across a company and stored in different
formats, or may contain inconsistencies such as incorrect or missing
entries.
Data cleaning is not just about removing bad data or interpolating
missing values, but about finding hidden correlations in the data,
identifying sources of data that are the most accurate, and
determining which columns are the most appropriate for use in
analysis.
Exploration techniques include calculating the minimum and maximum
values, calculating mean and standard deviations, and looking at the
distribution of the data. For example, you might determine by reviewing
the maximum, minimum, and mean values that the data is not
representative of your customers or business processes, and that you
therefore must obtain more balanced data or review the assumptions that
are the basis for your expectations. Standard deviations and other
distribution values can provide useful information about the stability and
accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually
contain any data until you process it. When you process the mining
structure, SQL Server Analysis Services generates aggregates and other
statistical information that can be used for analysis. This information can
be used by any mining model that is based on the structure
Exploring and Validating Models
Before you deploy a model into a production environment, you will want to
test how well the model performs. Also, when you build a model, you
typically create multiple models with different configurations and test all
models to see which yields the best results for your problem and your
data.
Deploying and Updating Models
After the mining models exist in a production environment, you can
perform many tasks, depending on your needs. The following are some of
the tasks you can perform:
Use the models to create predictions, which you can then use to
make business decisions.
Create content queries to retrieve statistics, rules, or formulas from the
model.
Embed data mining functionality directly into an application. You
can include Analysis Management Objects (AMO), which contains a
set of objects that your application can use to create, alter, process,
and delete mining structures and mining models.
Use Integration Services to create a package in which a mining
model is used to intelligently separate incoming data into multiple
tables.
Create a report that lets users directly query against an existing mining
model
Data warehousing
Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data warehousing
involves data cleaning, data integration, and data consolidations.
Course Faculty
Verified by HoD
LECTURE HANDOUTS L9
CSE II/III
Introduction :
Data consist of numbers, of course. But these numbers are fed into the
computer, not produced by it. These are numbers to be treated with considerable
respect, neither to be tampered with, nor subjected to a numerical process whose
character you do not completely understand. You are well advised to acquire a
reverence for data that is rather different from the “sporty” attitude that is
sometimes allowable, or even commendable, in other numerical tasks.
The analysis of data inevitably involves some trafficking with the field of
statistics, that gray area which is not quite a branch of mathematics and just as
surely not quite a branch of science. In the following sections, you will repeatedly
encounter the following paradigmpply some formula to the data to compute “a
statistic”. compute where the value of that statistic falls in a probability distribution
that is computed on the basis of some “null hypothesis”. if it falls in a very unlikely
spot, way out on a tail of the distribution, conclude that the null hypothesis is false for
your data set.
Course Faculty
Verified by HoD
CSE II/III
Introduction :
Today data is everywhere in every field. Whether you are a data scientist,
marketer, businessman, data analyst, researcher, or you are in any other profession,
you need to play or experiment with raw or structured data. This data is so important
for us that it becomes important to handle and store it properly, without any error.
While working on these data, it is important to know the types of data to process them
and get the right results.
SQL
Detailed content of the Lecture:
Nominal data.
Ordinal data.
Discrete data.
Continuous data.
Now business runs on data, and most companies use data for their insights to create
and launch campaigns, design strategies, launch products and services or try out
different things. According to a report, today, at least 2.5 quintillion bytes of data are
produced per day.
Types of Data
Qualitative or Categorical Data
Qualitative or Categorical Data is data that can’t be measured or counted in the form
of numbers. These types of data are sorted by category, not by number. That’s why it
is also known as Categorical Data. These data consist of audio, images, symbols, or
text. The gender of a person, i.e., male, female, or others, is qualitative data.
Qualitative data tells about the perception of people. This data helps market
researchers understand the customers’ tastes and then design their ideas and
strategies accordingly. .
The term discrete means distinct or separate. The discrete data contain the values
that fall under integers or whole numbers. The total number of students in a class is an
example of discrete data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not
possible. These data are represented mainly by a bar graph, number line, or frequency
table.
Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous variable
can take any value within a range. The key difference between discrete and
continuous data is that discrete data contains the integer or whole [Link]
of Continuous Data :
Height of a person
Speed of a vehicle
“Time-taken” to finish the work
Wi-Fi Frequency
Difference between Discrete and Continuous Data
Discrete data are countable and finite; Continuous data are measurable; they
they are whole numbers or integers are in the form of fractions or decimal
Discrete data are represented mainly by Continuous data are represented in the
bar graphs form of a histogram
The values cannot be divided into The values can be divided into
subdivisions into smaller pieces subdivisions into smaller pieces
Discrete Data Continuous Data
Course Faculty
Verified by HoD
CSE II/III
Mathematics
Detailed content of the Lecture:
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
The weights can be described not only as quantitative data but also
as observations for a quantitative variable, since the various
weights take on different numerical values.
By the same token, the replies can be described as observations
for a qualitative variable, since the replies to the Facebook profile
question take on different values of either Yes or No.
Given this perspective, any single observation can be described as
a constant, since it takes on only one value.
Discrete variables can only assume specific values that you cannot
subdivide. Typically, you count discrete values, and the results are
integers.
Examples
Counts- such as the number of children in a family. (1, 2, 3, etc., but
never 1.5)
These variables cannot have fractional or decimal values. You can have
20 or 21 cats, but not 20.5
The number of heads in a sequence of coin tosses.
The result of rolling a die.
The number of patients in a hospital.
The population of a country.
While discrete variables have no decimal places, the average of these
values can be fractional. For example, families can have only a discrete
number of children: 1, 2, 3, etc. However, the average number of children
per family can be 2.2.
Independent variables (IVs) are the ones that you include in the
model to explain or predict changes in the dependent variable.
Independent indicates that they stand alone and other variables in the
model do not influence them.
Independent variables are also known as predictors, factors,
treatment variables, explanatory variables, input variables, x-
variables, and right-hand variables—because they appear on the
right side of the equals sign in a regression equation.
It is a variable that stands alone and isn't changed by the other variables
you are trying to measure.
For example, someone's age might be an independent variable.
Other factors (such as
what they eat, how much they go to school, how much television
they watch)
Dependent Variable
When a variable is believed to have been influenced by the independent
variable, it is called a dependent variable. In an experimental setting,
the dependent variable is measured, counted, or recorded by the
investigator.
The dependent variable (DV) is what you want to use the model to
explain or predict. The values of this variable depend on other
variables.
It’s also known as the response variable, outcome variable, and
left-hand variable. Graphs place dependent variables on the
vertical, or Y, axis.
a dependent variable is exactly what it sounds like. It is something that
depends on other factors.
For example the blood sugar test depends on what food you ate, at which time you
ate etc.
Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator. Instead, it represents an outcome: the data produced by the experiment
Cofounding variable
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE L12
HANDOUTS
CSE II/III
Table structures
SQL
Database concepts
Detailed content of the Lecture:
Grouped Data
According to their frequency of occurrence.
When observations are sorted into classes of
more than one value result is referred to as a
frequency.
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS
L13
CSE II/III
Introduction :
A graph is a visual representation of numerical data. Graphs provide a visual way to
summarize complex data and to show the relationship between different variables or
sets of data. Graphs are also an excellent way to demonstrate trends and
relationships within the data.
Prerequisite knowledge for Complete understanding and learning of Topic:
Data structures
DBMS
Detailed content of the Lecture:
Data can be described clearly and concisely with the aid of a well-constructed
frequency distribution. And data can often be described even more vividly by
converting frequency distributions into graphs.
Histograms
A bar-type graph for quantitative data. The common boundaries between
adjacent bars emphasize the continuity of the data, as with continuous variables.
A histogram is a display of statistical information that uses rectangles to show the
frequency of data items in successive numerical intervals of equal size.
Using visual representations to present data from Indicators for School Health,
(SLIMS), surveys, or other evaluation activities makes them easier to understand.
Bar graphs, pie charts, line graphs, and histograms are an excellent way to illustrate
your program results. This brief includes concepts and definitions, types of graphs
and charts, and guidelines for [Link] of Graphs and Charts • A bar graph
is composed of discrete bars that represent different categories of data. The length
or height of the bar is equal to the quantity within that category of data. Bar graphs
are best used to compare values across categories.
Course Faculty
Verified by HoD
LECTURE HANDOUTS L14
CSE II/III
Introduction :
The mode reflects the value of the most frequently occurring [Link]
median reflects the middle value when observations are ordered from least
to most. The range is the difference between the largest and smallest
scores.
Prerequisite knowledge for Complete understanding and learning of Topic:
Statistics
averages
Detailed content of the Lecture:
Describing Data with Averages
MODE
The mode reflects the value of the
most frequently occurring score. In
other words
A mode is defined as the value that has a higher frequency in a given set
of values. It is the value that appears the most number of times.
Example:
In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it
has appeared in the set twice.
Types of Modes
Bimodal, Trimodal & Multimodal (More than one mode)
When there are two modes in a data set, then the set is called bimodal
For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because
both 2 and 5 is repeated three times in the given set.
When there are three modes in a data set, then the set is called
trimodal
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
When there are four or more modes in a data set, then the set is called
multimodal
Example 1:
Example 2:
Find the median of the following:
9,7,2,11,18,12,6,4
Solution n=8
When we put those numbers in the order we have:
2, 4, 6, 7, 9,11, 12, 18
Median = 1/2[(n/2)th term + {(n/2)+1}th term ]
= ½ [(8/2) term + ((8/2)+1)term]
=1/2[4th term+5th term] (in our list 4th term is 7 and 5th term is 9)
= ½[7+9]
=1/2(16)
=8
The median value of this set of numbers is 8.
MEAN
The mean is found by adding all scores and then dividing by the number of
scores.
Mean is the average of the given numbers and is calculated by dividing the
sum of given numbers by the total number of numbers.
Types of means
Sample mean
Population mean
Sample Mean
The sample mean is a central tendency measure. The arithmetic average
is computed using samples or random values taken from the population. It
is evaluated as the sum of all the sample variables divided by the total
number of variables.
Population Mean
The population mean can be calculated by the sum of all values in the given
data/population divided by a total number of values in the given
data/population.
Course Faculty
Verified by HoD
L15
LECTURE HANDOUTS
CSE II/III
Introduction :
The range is the difference between the largest and smallest scores.
Statistics
averages
Detailed content of the Lecture:
RANGE
The range is the difference between the largest and smallest scores.
The range in statistics for a given data set is the difference between the
highest and lowest values. For example, if the given data set is {2,5,8,10,3},
then the range will be 10 – 2 = 8.
28, 54, 35, 26, 23, 33, 38, 40. Solution: Let us first arrange
VARIANCE
Variance is a measure of how data points differ from the mean. A variance
is a measure of how far a set of data (numbers) are spread out from their
mean (average) value.
Formula
σ = Σ(x-μ)2 or
Variance = (Standard deviation)2= σ2 = > σ 2= Σ(x-μ)2 /n
= 90
Mean
=> μ
= 90 /
10 = 9
Deviati
on
from
mean
x- μ = -4, -1, -3, 1, 3, 0, 2,1,3,-2
(x-μ)2 = 16,1,9,1,9,0,4,1,9,4
Σ(x-μ)2 = 16+1+9+1+9+0+4+1+9+4
=54
σ 2= Σ(x-μ)2 /n
=54/10
= 5.4
The standard deviation, the square root of the mean of all squared
deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard)
amount by which scores deviate
Formula
Degree of freedom df = n-1
Example
Consider a data set consists of five positive integers. The sum of the five
integers must be the multiple of 6. The values are randomly selected as 3,
8, 5, and 4.
The sum of this for values is 20. So we have to choose the fifth integer to
make the sum divisible by 6. Therefore the fifth element is 10.
deviation:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS L16
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Basic Mathematics
Statistics
Detailed content of the Lecture:
The interquartile range (IQR), is simply the range for the middle 50
percent of the scores. More specifically, the IQR equals the distance between the
third quartile (or 75th percentile) and the first quartile (or 25 th percentile), that
is, after the highest quarter (or top 25 percent) and the lowest quarter (or
bottom 25 percent) have been trimmed from the original set of scores. Since
most distributions are spread more widely in their extremities than their middle,
the IQR tends to be less than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from
lowest to highest. To find the interquartile range (IQR), first find the median (middle
value) of the lower and upper half of the data. These values are quartile 1 (Q1) and
quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
THE NORMAL CURVE
The normal distribution is a continuous probability distribution that is
symmetrical on both sides of the mean, so the right side of the center is a
mirror image of the left side.
Properties of the Normal Curve
The normal curve is a theoretical curve defined for a continuous
variable, as described in Section 1.6, and noted for its symmetrical
bell-shaped form, as revealed in below figure
Because the normal curve is symmetrical, its lower half is the mirror
image of its upper half.
The normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the
peak (without actually touching the horizontal axis, since, in theory,
the tails of a normal curve extend infinitely far).
Properties of a normal distribution
The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the
values are to the right.
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
Cou
rse Faculty
V
erified by HoD
LECTURE HANDOUTS L17
CSE II/III
Introduction :
The normal curve is a theoretical curve defined for a continuous variable, and
noted for its symmetrical bell-shaped form
The normal curve peaks above a point midway along the horizontal spread and
then tapers off gradually in either direction from the peak
Prerequisite knowledge for Complete understanding and learning of Topic:
Basic Mathematics
Statistics
Discrete Mathematics
Detailed content of the Lecture:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS
L18
CSE II/III
Introduction :
Probability
Statistics
Mathematics
Detailed content of the Lecture:
z SCORES
Standard deviation = 1
FINDING SCORES
So far, we have concentrated on normal curve problems for which
Table A must be consulted to find the unknown proportion (of area)
associated with some known score or pair of known scores
Now we will concentrate on the opposite type of normal curve problem
for which Table A must be consulted to find the unknown score or scores
associated with some known proportion.
For this type of problem requires that we reverse our use of Table A
by entering proportions in columns B, C, B′, or C′ and finding z scores listed in
columns A or A′.
FINDING PROPORTIONS
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS L19
CSE II/III
Correlation
Correlation refers to a process for establishing the relationships between
two variables. You learned a way to get a general idea about whether or not
two variables are related, is to plot them on a “scatter plot”. While there are
many measures of association for variables which are measured at the
ordinal or higher level of measurement, correlation is the most commonly
used approach.
Probability
Statistics
Mathematics
Detailed content of the Lecture:
Types of Correlation
Positive Correlation – when the values of the two variables move in
the same direction so that an increase/decrease in the value of one
variable is followed by an increase/decrease in the value of the other
variable.
Negative Correlation – when the values of the two variables move
in the opposite direction so that an increase/decrease in the value of
one variable is followed by decrease/increase in the value of the other
variable.
variables.
SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all
pairs of scores. In other words Scatter plots are the graphs that present the
relationship between two variables in a data-set. It represents data points on a
two-dimensional plane or on a Cartesian system.
The first step is to note the tilt or slope, if any, of a dot cluster.
A dot cluster that has a slope from the lower left to the upper right, as in
panel A of below figure reflects a
positive relationship.
A dot cluster that has a slope from the upper left to the lower right, as in
panel B of below figure reflects a
negative relationship.
A dot cluster that lacks any apparent slope, as in panel C of below figure reflects
little or no relationship.
A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.
Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight
line and, therefore, reflects a linear relationship. But this is not always
the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.
Course Faculty
Verified by HoD
LECTURE HANDOUTS L20
CSE II/III
Introduction :
The correlation coefficient, r, is a summary measure that describes the
extent of the statistical relationship between two interval or ratio level
variables.
Prerequisite knowledge for Complete understanding and learning of Topic:
Statistics
DBMS
Detailed content of the Lecture:
Properties of r
The correlation coefficient is scaled so that it is always between -1 and +1.
When r is close to 0 this means that there is little relationship
between the variables and the farther away from 0 r is, in either the
positive or negative direction, the greater the relationship between
the two variables.
The sign of r indicates the type of linear relationship, whether positive or
negative.
The numerical value of r, without regard to sign, indicates the strength of
the linear relationship.
A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a negative
relationship
Course Faculty
Verified by HoD
LECTURE HANDOUTS
L21
CSE II/III
Introduction :
Statistics
DBMS
Detailed content of the Lecture:
Properties of r
The correlation coefficient is scaled so that it is always between -1 and +1.
When r is close to 0 this means that there is little relationship
between the variables and the farther away from 0 r is, in either the
positive or negative direction, the greater the relationship between
the two variables.
The sign of r indicates the type of linear relationship, whether positive or
negative.
The numerical value of r, without regard to sign, indicates the strength of
the linear relationship.
A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a negative
relationship
The sum of the products term in the numerator, SPxy, is defined in below
formula
Course Faculty
Verified by HoD
CSE II/III
Distribution concepts
Probability
Statistics
Detailed content of the Lecture:
A Regression Line
A regression line is a line that best describes the behavior of a set of data.
In other words, it’s a line that best fits the trend of a given data.
The purpose of the line is to
describe the interrelation of a
dependent variable (Y variable)
with one or many independent
variables (X variable). By using the
equation obtained from the
regression line an analyst can
forecast future behaviours of the
dependent variable by inputting
different values for the
independent ones.
Types of regression
The two basic types of regression are
Simple linear regression
Simple linear regression uses one independent variable to explain or predict the
outcome of the dependent variable Y
Multiple linear regression
Multiple linear regressions use two or more.
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
Predictive Error
Prediction error refers to the difference between the predicted values
made by some model and the actual values.
Course Faculty
Verified by HoD
LECTURE HANDOUTS L23
CSE II/III
Introduction :
The placement of the regression line minimizes not the total predictive
error but the total squared predictive error, that is, the total for all squared
predictive errors. When located in this fashion, the regression line is often
referred to as the least squares regression [Link] Least Squares Regression
Line is the line that minimizes the sum of the residuals squared. The residual is
the vertical distance between the observed point and the predicted point, and it
is calculated by subtracting ˆy from y.
Prerequisite knowledge for Complete understanding and learning of Topic:
Probability
Discrete Mathematics
Statistics
Detailed content of the Lecture:
The Least Squares Regression Line is the line that minimizes the sum
of the residuals squared. The residual is the vertical distance between the
observed point and the predicted point, and it is calculated by subtracting
ˆy from y.
Formula
y’ = bx+a b – slope , a – y intercept
b= N Σ(xy) − Σx Σy
N Σ(x2) − (Σx)2
b = Σy − m Σx
N
In statistics, ordinary least squares (OLS) is a type of linear least
squares method for choosing the unknown parameters in a linear regression model
(with fixed level-one effects of a linear function of a set of explanatory variables) by
the principle of least squares: minimizing the sum of the squares of the differences
between the observed dependent variable (values of the variable being observed) in
the input dataset and the output of the (linear) function of the independent variable.
Geometrically, this is seen as the sum of the squared distances, parallel to the
axis of the dependent variable, between each data point in the set and the
corresponding point on the regression surface—the smaller the differences, the
better the model fits the data. The resulting estimator can be expressed by a simple
formula, especially in the case of a simple linear regression, in which there is a
single regressor on the right side of the regression equation.
Example
"x" "y"
2 4
3 5
5 7
7 10
9 15
x y y= error
1.518x +
0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.8
9
7 10 10.93 0.9
3
9 15 13.97 −1.03
To predict the y value we can
assume any value for x. Assume
x = 8.
Then y = 1.518 x 8 + 0.305
= 12.45
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
Course Faculty
Verified by HoD
CSE II/III
Introduction :
Probability
Discrete Mathematics
Statistics
Detailed content of the Lecture:
The standard error of estimate and symbolized as s y | x, this estimate of predictive
error complies with the general format for any sample standard deviation, that is, the
square root of a sum of squares term divided by its degrees of freedom
Example
Calculate the standard error of estimate for the given X and Y values. X =
1,2,3,4,5 Y=2,4,5,4,5
Solution
Create five columns labeled x, y, y’, y – y’, ( y – y’)2 and N=5
Note: for finding b value we have to find xy and x2, so add xy and x2 column
in table
b= 30/50 = 0.6
a = Σy − b Σx
N
= 20 – (0.6 x 15)
5
= 20 – 11
5
a=
9/5 =
2.2
=√(2.4/3)
SSy/x = 0.894
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS L25
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Statistics
Detailed content of the Lecture:
Example:
A researcher decides to study students’ performance from a school over a
period of time. He observed that as the lectures proceed to operate online,
the performance of students started to decline as well. The parameters for
the dependent variable “decrease in performance” are various independent
variables like “lack of attention, more internet addiction, and neglecting
studies” and much more.
Formula to find
multiple
y = b1x1 + b2x2 + … bnxn + a
Example
A military commander has two units return, one with 20% casualties and
another with 50% casualties. He praises the first and berates the second.
The next time, the two units return with the opposite results. From this
experience, he “learns” that praise weakens performance and berating
increases performance.
Data science from scratch :author joel grus page nos: 403
Essential math of data scienc, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
L26
LECTURE HANDOUTS
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Probability
Discrete Mathematics
Statistics
Detailed content of the Lecture:
Example:
A researcher decides to study students’ performance from a school over a
period of time. He observed that as the lectures proceed to operate online,
the performance of students started to decline as well. The parameters for
the dependent variable “decrease in performance” are various independent
variables like “lack of attention, more internet addiction, neglecting
studies” and much more.
Formula to find
multiple r
y = b1x1 + b2x2 + … bnxn + a
Course Faculty
Verified by HoD
CSE II/III
Statistics
Detailed content of the Lecture:
Course Faculty
Verified by HoD
CSE II/III
Python
Mathematics
Detailed content of the Lecture:
Attributes of arrays
Determining the size, shape, memory consumption, and data types of
arrays
Indexing of arrays
Getting and setting the value of individual array elements
Slicing of arrays
Getting and setting smaller sub arrays within a larger array
Reshaping of arrays
Changing the shape of a given array
Example
[Link](0) # seed for reproducibility
print("x
3 ndim:
",
[Link]
)
print("x
3
shape:",
[Link]
e)
print("x
3 size:
",
[Link])
print("dtype:", [Link])
print("itemsize:",
[Link], "bytes")
print("nbytes:", [Link],
"bytes")
Array Indexing:
Accessing Single Elements
array([0, 1, 2, 3, 4])
array([5, 6, 7, 8, 9])
Reshaping of Arrays
The most flexible way of doing this is with the reshape() method. For
example, if you want to put the numbers 1 through 9 in a 3×3 grid, you can
do the following
grid =
[Link](1,
10).reshape((3,
3)) print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]
Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished
through the routines [Link], [Link], and [Link].
[Link] takes a tuple or list of arrays as its first argument. x =
[Link]([1, 2, 3])
array([1, 2, 3, 3, 2, 1])
oncatenate([x,
y, z])) [ 1 2 3 3 2
1 99 99 99]
[Link] can also be used for two-dimensional arrays
grid = [Link]([[1, 2, 3],
[4, 5, 6]])
[Link]([grid, grid])
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
Concatenate along the second axis (zero-indexed)
[Link]([grid, grid], axis=1)
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])
x = [Link]([1, 2, 3])
grid = [Link]([[9, 8, 7],
[6, 5, 4]])
[Link]([x, grid])
array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])
y = [Link]([[99],
[99]])
[Link]([grid, y])
array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])
Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions
[Link], [Link], and
[Link]. For each of these, we can pass a list of indices giving the split points
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2,
x3 =
[Link](
x, [3, 5])
print(x1,
x2, x3)
[1 2 3] [99 99] [3 2 1]
Notice that N split points lead to N + 1 subarrays. The related functions [Link]
and [Link] are similar
grid =
[Link](16).r
eshape((4, 4))
grid
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
Course Faculty
Verified by HoD
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Introducing UFuncs
NumPy provides a convenient interface into just this kind of statically typed,
compiled routine. This is known as a vectorized operation.
Python
Mathematics
Data base
Detailed content of the Lecture:
Array arithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The
standard addition, subtraction, multiplication, and division can all be used.
x = [Link](4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also
understands Python’s built-in absolute value function.
[Link]()
[Link]()
x = [Link]([-2, -1, 0, 1,
2]) abs(x)
The corresponding NumPy ufunc is [Link], which is also available under the
alias [Link]
[Link](x)
array([2, 1, 0, 1, 2])
Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful
for the data scientist are the trigonometric functions.
[Link]()
[Link]()
[Link]()
inverse trigonometric functions
[Link]()
[Link]()
[Link]()
Python has built-in min and max functions, used to find the minimum
value and maximum value of any given array.
For min, max, sum, and several other NumPy aggregates, a shorter syntax is
to use methods of the array object itself.
Multidimensional aggregates
One common type of aggregation operation is an aggregate along a row or
column.
By default, each NumPy aggregation function will return the aggregate over the
entire array. ie. If we use the [Link]() it will calculates the sum of all elements of
the array.
Example
m = [Link]((3, 4))
print(M)
Exponentiation is a type of operation where two elements are used in which one
element is considered as a base element and another as an exponential
element.
For example, b is an example of exponential operation where x is a base
element
and y is an exponential element.
When y is a positive integer, exponentiation is performed in a similar way to
repeated
multiplication is performed.
Modular exponentiation is a type of exponentiation in which a modulo division
operation is performed after performing an exponentiation operation.
Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction
between the two arrays.
• Rule 1: If the two arrays differ in their number of dimensions, the shape
of the one with fewer dimensions is padded with ones on its
leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the
array with shape equal to 1 in that dimension is stretched to match
the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error
is raised
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS
CSE L30
II/III
Course Name with Code : CS3352 & Foundations of Data Science
Mathematics
Probability
Python
Detailed content of the Lecture:
x=
[Link]
([1, 2, 3,
4, 5]) x
<3#
less
than
x == 3 # equal
array([False, False, True, False, False], dtype=bool)
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape.
Here is a two- dimensional example
rng =
[Link]
domState(0) x
=
[Link](10,
size=(3, 4))
x
array([[5, 0,
3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])
Boolean operators
Operator Equivalent ufunc &
np.bitwise_and
| np.bitwise_or
^ np.bitwise_xor
~ np.bitwise_not
x<5
array([[False,
True,
True,
True],
[False,
False,
True,
False],
[ True, True, False, False]], dtype=bool)
Masking operation
To select these values from the array, we can simply index on this Boolean array; this is known as
a masking operation.
x[x < 5]
array([0, 3, 3, 3, 2, 4])
What is returned is a one-dimensional array filled with all the values that meet this condition;
in other words, all the values in positions at which the mask array is True.
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
LECTURE HANDOUTS
L31
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Data structure
Python
Database
Detailed content of the Lecture:
import numpy as np
rand = [Link](42) x = [Link](100,
size=10) print(x)
[51 92 14 71 60 20 82 86 74 74]
Array of indices
We can pass a single list or array of indices
to obtain the same result. ind = [3, 7,
4]
x[ind]
Multi dimensional
Fancy indexing also works in multiple dimensions. Consider the following array.
X=
[Link](12).reshape((3,
4)) Xarray([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
Standard indexing
Like with standard indexing, the first index refers to the row,
and the second to the column. row = [Link]([0, 1,
2])
col = [Link]([2, 1, 3])
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the
other indexing schemes we’ve seen.
Example array
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
Combine fancy and simple indices
X[2, [2, 0, 1]]
array([10, 8, 9])
[Link]([1, 0, 1,
0], dtype=bool)
X[row[:,
[Link]],
mask]
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
SORTING ARRAYS
Python has built-in sort and sorted functions to work with lists, we won’t
discuss them here because NumPy’s [Link] function turns out to be much
more efficient and useful for our purposes. By default [Link] uses an O[ N log
N], quicksort algorithm, though mergesort and heapsort are also available.
For most applications, the default quicksort is more than sufficient.
array([2, 1, 3, 4, 6, 5, 7])
Note that the first three values in the resulting array are the three smallest in
the array, and the remaining array positions contain the remaining values.
Within the two partitions, the elements have arbitrary order.
array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])
Structured Arrays
This section demonstrates the use of NumPy’s structured arrays and record
arrays, which provide efficient storage for compound, heterogeneous data.
NumPy data types
Character Description Example
'b' Byte [Link]('b')
'i' Signed integer [Link]('i4') == np.int32
'u' Unsigned integer [Link]('u1') == np.uint8
'f' Floating point [Link]('f8') == np.int64
'c' Complex floating point [Link]('c16') ==
np.complex128
'S', 'a' string [Link]('S5')
'U' Unicode string
[Link]('U') == np.str_ 'V' Raw
data (void) [Link]('V') ==
[Link]
[('name', '<U10'),
('age', '<i4'), ('weight',
'<f8')] U10 - Unicode
string of maximum
length 10 i4 - 4-byte
(i.e., 32 bit) integer
f8 - 8-byte (i.e., 64 bit) float
Now we can fill the array with our lists of values
data['n
ame']
=
name
data['a
ge'] =
age
data['
weight'
]=
weight
print(
data)
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)('Doug', 19, 61.5)]
('weight', '<f8')])
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course
Faculty
Verified by HoD
LECTURE HANDOUTS L32
CSE II/III
Python
Mathematics
Database
Detailed content of the Lecture:
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
Finding values
The values are simply a familiar NumPy array
[Link]
array([ 0.25, 0.5 , 0.75, 1. ])
Finding index
The index is an array-like object of type [Link]
[Link]
Series as generalized NumPy array
the NumPy array has an implicitly defined integer index used to access the
values, the Pandas Series has an explicitly defined index associated with the
values.
This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of
any desired type.
For example, if we wish, we can use strings as an index.
Strings as an index
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64
DS FD
S
sai 90 91
ram 85 95
kasi 92 89
m
tamil 89 90
Verified by HoD
CSE II/III
Database
Python
Mathematics
Detailed content of the Lecture:
Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection
of keys to a collection of values.
data = [Link]([0.25, 0.5, 0.75, 1.0],
b 0.50
c 0.75
d 1.00
dtype: float64
data['b']
a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64
a 0.25
b 0.50
c 0.75
dtype: float64
Masking
data[(data > 0.3) & (data < 0.8)]
b 0.50
c 0.75
loc - the loc attribute allows indexing and slicing that always references the
explicit index.
dtype: object
iloc - The iloc attribute allows indexing and slicing that always references the
implicit Python-style index.
dtype: object
ix- ix is a hybrid of the two, and for Series objects is equivalent to standard [ ]-
based indexing.
Data Selection in DataFrame
DataFrame as a dictionary
DataFrame as two-dimensional array
Additional indexing conventions
DataFrame as a dictionary
Modifying values
Indexing conventions may also be used to set or modify values; this
is done in the standard way that you might be accustomed to from
working with NumPy.
[Link][1,1] =70
DS FDS TOTAL
sai 90 91 181
ram 85 70 180
kasi 92 89 181
m
tamil 89 90 179
result['sai':'kasim']
DS FDS TOTAL
sai 90 91 181
kasim 92 89 181
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, author Thomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
CSE II/III
Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas
Series and DataFrame objects. We can use all arithmetic and special universal
functions as in NumPy on pandas. In outputs the index will preserved
(maintained) as shown below.
For series
01
12
23
34
dtype: int64
For DataFrame
df=[Link]([Link](0,10,(3,4)),
columns=['a','b','c','d'])
df
a b c d
0 1 4 1 4
1 8 4 0 4
2 7 7 7 2
0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64
Ufuncs for Data Frame
[Link](df)
0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64
Ufuncs for Data Frame
[Link](df)
Index Alignment
Pandas will align indices in the process of performing the operation. This is
very convenient when you are working with incomplete data, as we’ll.
1 3.0
2 3.0
3 9.0
4 7.0
5 6.0
dtype: float64
A - A[0]
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
Handling Missing Data
A number of schemes have been developed to indicate the presence of
missing data in a table or DataFrame. Generally, they revolve around one of
two strategies: using a mask that globally indicates missing values, or
choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean
array, or it may involve appropriation of one bit in the data representation to
locally indicate the null status of a value.
dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects any other
object it touches. Regardless of the operation, the result of arithmetic with
NaN will be another NaN
0123
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN
Course Faculty
Verified by HoD
CSE II/III
OOP language
Python
Database
Detailed content of the Lecture:
Authentication Requirement
Pandas MultiIndex
Pandas provides a better way. Our tuple-based indexing is essentially a
rudimentary multi-index, and the Pandas MultiIndex type gives us the type of
operations we wish to have. We can create a multi-index from the tuples as
follows
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop =
[Link](populations,
index=index) pop
Universal functions
All the ufuncs and other functionality work with hierarchical indices.
f_u18 =
pop_df['under18'] /
pop_df['total']
f_u18.unstack()
2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568
Methods of Multi Index Creation
To construct a multiply indexed Series or DataFrame is to simply pass a list of two
or more index arrays to the constructor.
df = [Link]([Link](4, 2), index=[['a', 'a', 'b', 'b'], [1, 2, 1,
2]],
columns=['data1', 'data2'])
df
data1 data2
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
2 0.171495 0.886688
if you pass a dictionary with appropriate tuples as keys, Pandas will
automatically recognize this and use a MultiIndex by default.
Verified by HoD
CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
x y [Link]([x, y])
AB A B AB
0 A0 B0 0 A2 B2 0 A0 B0
1 A1 B1 1 A3 B3 1 A1 B1
0 A2 B2
1 A3 B3
[Link](df2)
AB AB AB
1 A1 B1 3 A3 B3 1 A1 B1
2 A2 B2 4 A4 B4 2 A2 B2
3 A3 B3
4 A4 B4
Categories of Joins
One-to-one joins
Many-to-one joins
Many-to-many joins
df2 = [Link]({'employee':
['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012,
2014]})
Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains
duplicate entries. For the many-to- one case, the resulting DataFrame will
preserve those duplicate entries as appropriate.
df4 = [Link]({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
[Link](df3, df4)
Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless
well defined. If the key column in both the left and right array contains
duplicates, then the result is a many-to-many merge. This will be perhaps
most clear with a concrete example.
df5 = [Link]({'group': ['Accounting', 'Accounting', 'Engineering',
'Engineering', 'HR', 'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
[Link](df1, df5)
52 13.000000 3.428571
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
CSE II/III
Short assignment
linestyle='-' # solid
linestyle='-' # solid
linestyle='-.' # dashdot
Course Faculty
Verified by HoD
CSE II/III
Another commonly used plot type is the simple scatter plot, a close cousin of
the line plot. Instead of points being joined by line segments, here the points
are represented individually with a dot, circle, or other shape.
Syntax
[Link](x, y, 'type of symbol ', color);
Example
[Link](x, y, 'o', color='black');
The third argument in the function call is a character that represents the
type of symbol used for the plotting. Just as you can specify options such
as '-' and '--' to control the line style, the marker style has its own set of
short string codes.
Example
Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
Short hand assignment of line, symbol and color also allowed.
[Link](x, y, '-ok');
Scatter plot with edge color, face color, size, and width of marker.
(Scatter plot with line)
import numpy as np
import [Link] as plt x = [Link](0,
10, 20)
y = [Link](x)
[Link](x, y, '-o',
color='gray',
markersize=15,
linewidth=4,
markerfacecolor='yello
w',
markeredgecolor='red'
, markeredgewidth=4)
[Link](-1.5, 1.5);
Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
import
matplotlib.
pyplot as
plt
[Link]
('seaborn-
whitegrid')
import
numpy as
np
x = [Link](0, 10, 50)
dy = 0.8
y = [Link](x) + dy * [Link]
Here the fmt is a format code controlling the appearance of lines and
points, and has the same syntax as the shorthand used in [Link]()
In addition to these basic options, the errorbar function has many
options to fine tune the outputs. Using these additional options you
can easily customize the aesthetics of your errorbar plot.
[Link](x, y, yerr=dy, fmt='o', color='black',ecolor='lightgray', elinewidth=3,
capsize=0);
Continuous Errors
In some situations it is desirable to show errorbars on continuous
quantities. Though Matplotlib does not have a built-in convenience
routine for this type of application, it’s relatively easy to combine
primitives like [Link] and plt.fill_between for a useful result.
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
CSE II/III
Biometric authentication is simply the process of verifying your identity using your
measurement or other unique characteristics of your body, then logging you in a
service, an app, a device and so [Link] identification verifies you are you
based on your body [Link] identification systems can be grouped
based on the main physical characteristic that leads itself to biometric identification.
Fingerprint identification, hand geometry, retina scan, iris scan, face recognition,
signature voice [Link] authentication is also called weak authentication
This is amongst the most conventional schemes where in a user has an user it and a
password. Uses id acts like a claim and password as evidence supporting the claim.
Prerequisite knowledge for Complete understanding and learning of Topic:
Linear Algebra
Number Theory
Combinatory
Programming language such as java and python
Detailed content of the Lecture:
return [Link](x) **
10 + [Link](10 + y * x) *
[Link](x) x = [Link](0,
5, 50)
[Link](X, Y, Z, colors='black');
Notice that by default when a single color is used, negative values are
represented by dashed lines, and positive values by solid lines.
Alternatively, you can color-code the lines by specifying a colormap with the
cmap argument.
We’ll also specify that we want more lines to be drawn—20 equally spaced
intervals within the data range.
One potential issue with this plot is that it is a bit “splotchy.” That is, the
color steps are discrete rather than continuous, which is not always what
is desired.
You could remedy this by setting the number of contours to a very high
number, but this results in a rather inefficient plot: Matplotlib must render
a new polygon for each step in the level.
A better way to handle this is to use the [Link]() function, which
interprets a two-dimensional grid of data as an image.
Example Program
import numpy as np
import
matplotlib.p
yplot as plt
def f(x, y):
return [Link](x) ** 10 +
[Link](10 + y * x) *
[Link](x)
x = [Link](0, 5, 50)
y =
[Link]
nspa
ce(0,
5,
40)
X, Y
=
np.
mes
hgri
d(x,
y) Z
=
f(X,
Y)
[Link](Z,
extent=[0, 10,
0, 10],
origin='lower',
cmap='RdGy')
[Link]()
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
CSE II/III
Linear Algebra
Number Theory
Combinatorics
Programming language such as java and python
Detailed content of the Lecture:
Histograms
Histogram is the simple plot to represent the large data set. A histogram
is a graph showing frequency distributions. It is a graph showing the
number of observations within each given interval.
Parameters
[Link]( ) is used to plot histogram. The hist() function will use an array of
numbers to create a histogram, the array is sent into the function as an
argument.
Other parameter
**kwargs - Patch
properties, it allows us to
pass a variable number
of keyword arguments to
a python function. **
denotes this type of
function.
Example
import numpy as np
import [Link] as plt
[Link]('seaborn-white')
data = [Link](1000)
[Link](data);
The hist() function has many options to tune both the calculation and the
display; here’s an example of a more customized histogram.
[Link](data, bins=30, alpha=0.5,histtype='stepfilled',
color='steelblue',edgecolor='none');
Legends
Plot legends give meaning to a visualization, assigning labels to the various plot
elements. We previously saw how to create a simple legend; here we’ll take a
look at customizing the placement and aesthetics of the legend in Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot
elements. We previously saw how to create a simple legend; here we’ll take a
look at customizing the placement and aesthetics of the legend in Matplotlib
[Link](x, [Link](x), '-b', label='Sine')
[Link](x, [Link](x), '--
r', label='Cosine')
[Link]();
Customizing Plot Legends
Location and turn off the frame - We can specify the location and turn off
the frame. By the parameter loc and framon.
[Link](loc='upp
er left',
frameon=False) fig
Number of columns - We can use the ncol command to specify the number of
columns in the legend.
[Link](frameon=False,
loc='lower center', ncol=2) fig
Multiple legends
It is only possible to create a single
legend for the entire plot. If you try to
create a second legend using [Link]()
or [Link](), it will simply override the
first one. We can work around this by
creating a
new legend artist from scratch, and then using the lower-level ax.add_artist()
method to manually add the second artist to the plot
Example
import
[Link] as plt
[Link]('classic')
import numpy as np
x = [Link](0, 10, 1000)
[Link](loc='lower center', frameon=True,
shadow=True,borderpad=1,fancybox=True) fig
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
CSE II/III
Subplots
Matplotlib has the concept of subplots: groups of smaller axes that can exist
together within a single figure.
These subplots might be insets, grids of plots, or other more complicated
layouts.
We’ll explore four routines for creating subplots in Matplotlib.
[Link]: Subplots by Hand
[Link]: Simple Grids of Subplots
[Link]: The Whole Grid in One Go
[Link]: More Complicated Arrangements.
Here we’ll create a 2×3 grid of subplots, where all axes in the same row
share their y- axis scale, and all axes in the same column share their x-axis
scale
fig, ax = [Link](2, 3, sharex='col', sharey='row')
Note that by specifying sharex and sharey, we’ve automatically removed inner
labels on the grid to make the plot cleaner.
Example
import
matplotlib.p
yplot as plt
import
matplotlib
as mpl
[Link]
('seaborn-
whitegrid')
import
numpy as
np
import pandas as pd
fig, ax =
[Link](facecolo
r='lightgray')
[Link]([0, 10, 0,
10])
# transform=[Link] is the default, but
we'll specify it anyway [Link](1, 5, ". Data:
(1, 5)", transform=[Link])
[Link](0.5, 0.1, ". Axes: (0.5, 0.1)", transform=[Link])
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course
Faculty
Verified by HoD
CSE II/III
Example
import [Link] as
plt import matplotlib as mpl
[Link]('seaborn-
whitegrid') import numpy as
np
import pandas as pd
fig, ax = [Link](facecolor='lightgray') [Link]([0,
10, 0, 10])
# transform=[Link] is the default, but
we'll specify it anyway [Link](1, 5, ". Data:
(1, 5)", transform=[Link])
[Link](0.5, 0.1, ". Axes: (0.5, 0.1)", transform=[Link])
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course
Faculty
Verified by HoD
LECTURE HANDOUTS L43
CSE II/III
import numpy as np
import [Link] as
plt from mpl_toolkits import
mplot3d fig = [Link]()
ax =
[Link](projection='3d')
ax.plot_wireframe(X, Y, Z,
color='black')
ax.set_title('wireframe');
[Link]()
Video Content / Details of website for further learning:
[Link]
[Link]
[Link]
[Link]
Important Books/Journals for further learning including the page nos.:
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Course Faculty
Verified by HoD
LECTURE HANDOUTS
L44
CSE II/III
Map Projections:
The Basemap package implements several dozen such projections, all referenced
by a short format code. Here we’ll briefly demonstrate some of the more common
ones.
Cylindrical projections
Pseudo-cylindrical projections
Perspective projections
Conic projections
Cylindrical projection
The simplest of map projections are cylindrical projections, in which lines
of constant latitude and longitude are mapped to horizontal and vertical
lines, respectively.
This type of mapping represents equatorial regions quite well, but
results in extreme distortions near the poles.
The spacing of latitude lines varies between different cylindrical
projections, leading to different conservation properties, and different
distortion near the poles.
Other cylindrical projections are the Mercator (projection='merc')
and the cylindrical equal-area (projection='cea') projections.
The additional arguments to Basemap for this view specify the latitude
(lat) and longitude (lon) of the lower-left corner (llcrnr) and upper-right
corner (urcrnr) for the desired map, in units of degrees.
import numpy as np
Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians
(lines of constant longitude) remain vertical; this can give better
properties near the poles of the projection.
The Mollweide projection (projection='moll') is one common example of
this, in which all meridians are elliptical arcs
It is constructed as map: though there are distortions near the poles, the
area of small patches reflects the true area.
Other pseudo-cylindrical projections are the sinusoidal (projection='sinu')
and Robinson (projection='robin') projections.
import numpy as np
import [Link] as plt
from mpl_toolkits.basemap import Basemap fig = [Link](figsize=(8, 6),
edgecolor='w')
m = Basemap(projection='moll', resolution=None, lat_0=0,
lon_0=0)
draw_map(m)
Course Faculty
Verified by HoD
CSE II/III
Linear Algebra
Number Theory
Combinatorics
Programming language such as java and python
Detailed content of the Lecture:
We can see the joint distribution and the marginal distributions together using
[Link]
Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with
pair plots. This is very useful for exploring correlations between multidimensional
data, when you’d like to plot all pairs of values against each other.
We’ll demo this with the Iris dataset, which lists measurements of petals and
sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
[Link](iris, hue='species',
size=2.5);
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you
to view the distribution of a parameter within bins defined by any other
parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use [Link] to show the
joint distribution between different datasets, along with the associated marginal
distributions.
Joint distributions
Similar to the pair plot we saw earlier, we can use [Link] to show the
joint distribution between different datasets, along with the associated marginal
distributions.
Data science from scratch :author joel grus page nos: 403
Essential math of data science, authorThomas neild ppage nos:347
Become a data head author:alex gutman page nos-272
Course Faculty
Verified by HoD
Department of Computer Science and Engineering
Question Bank – Academic Year (2024-25)
Course code & Course Name: CS3352 & Foundations of Data science
Name of the Faculty : [Link]
Year/Sem/Sec : II/III/A
UNIT-I
PART-A (2 Marks)
1. Define Data Science.
2. Discuss in brief about the tools for data science model building.
UNIT-II
PART-A (2 marks)
1. Define data. What are the type of data
2. What is qualitative data? Give example.
6. What is variance?
7. What is z score?
1. Elaborate the different ways to describe or represent data using tables with
suitable example.
2. Explain the various way by which data can be represented or described using
graph
3. Explain the different types of frequency distribution with suitable example and
diagrams.
4. Using the computation formula for the sum of squares, calculate the population
5. Compute the mean ,median and mode for the following data sets
45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70 *26, 26, 28, 27, 26, 27, 26, 26.
1. Using computation formula for the sum of squares calculate the population
standard
1, 3,7,2,0,4,3,7
10, 8, 5,0,1,7,9,2,1
2. Consider the test scores approximating a normal curve with a mean of 500 and
a standard deviation of 100. Sketch a normal curve and shade in the target
describe by the following:
139,139,139,145,1475,150,145,136,150,152,144,138,138,150,149,133,13
4,152,155,151
UNIT-III
PART-A (2 marks)
1. What is correlation?
3. What is a scatterplot?
7. What is outlier?
8. Define regression
correlation coefficients.
4. For the standard error of the estimate of the mean weight of high school
football
Weights in Pounds :150 ,203, 176, 190 ,168 ,193, 189, 178, 197, 172
1. Calculate and analyse the correlation coefficient between the number of study
hours
Study hours: 2 4 6 8 10
Sleeping hours: 10 9 8 7 6
Subject : 1 ,2 ,3 ,4 ,5 ,6
Age: 43,21,25,42,57,59
UNIT-IV
PART-A (2 marks)
1. What is numpy? List its use.
1. List the prime numbers between 0 and 100 by using a Boolean array.
UNIT-V
PART-A (2 marks)
1. Write the significance of data visualization
6. What is histogram?
1. Write python program to plot line chart by assuming your own data and explain.
2. Demonstrate the usage of histogram for data exploration and explain its
attributes.
3. Write python program to visualize data set using scatter plots and explain.
5. Explain in detail about the functions of mpl tool kit for geographic data
visualization.