0% found this document useful (0 votes)
47 views161 pages

CS3352 Foundations of Data Science Syllabus

The document outlines the course file for CS3352, Foundations of Data Science, for the academic year 2024-2025, taught by Ms. P. Dheevambiga. It includes the syllabus, course objectives, outcomes, lesson plans, and a checklist of required materials. The course covers fundamental data science concepts, data wrangling using Python, and data visualization techniques.

Uploaded by

dheevambiga92
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views161 pages

CS3352 Foundations of Data Science Syllabus

The document outlines the course file for CS3352, Foundations of Data Science, for the academic year 2024-2025, taught by Ms. P. Dheevambiga. It includes the syllabus, course objectives, outcomes, lesson plans, and a checklist of required materials. The course covers fundamental data science concepts, data wrangling using Python, and data visualization techniques.

Uploaded by

dheevambiga92
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

COURSE FILE

Academic Year: 2024 - 2025

Course Code & Name : CS3352 & Foundations of Data Science

Year/Semester/Section : II/III/A

Name of the Faculty : [Link]

Designation : Assistant Professor

Department of the Faculty : Computer Science and Engineering


Department of Computer Science and Engineering
Syllabus - Academic Year (2024-25)

COURSE FILE – CHECK LIST Co CHECK

CSE 2024-25

Course Code & Course Name : CS3352 & Foundations of Data Science
Name of the Faculty : [Link]
Year/Sem/Sec : II/III/A
Available
S. No. Content
(Yes/No)
1. Syllabus Copy

2. Lesson Plan

3. Preamble

4. Subject Timetable
5. Minutes of the Course Committee Meeting (2 Meetings/Semester)
6. Course Material (Lecture Handouts)

7. Content Beyond the Syllabus

8. Question Bank

9. CIA I – Question Paper, Answer Key & Sample Papers (3 Nos)

10. CIA II – Question Paper, Answer Key & Sample Papers (3 Nos)

11. CIA III – Question Paper, Answer Key & Sample Papers (3 Nos)
12. Model Exam- Question Paper, Answer Key & Sample Papers (3 Nos)
13. MKC Material

14. Useful Websites / E-Content Details


End Semester Results and Analysis / Suggestions for Improvement in
15.
next Semester (if needed)
16. Previous End Semester Examination Question Papers

17. List CO, PO, PEO/PSO Mapping with Attainments

18. Any other Content

Faculty Signature :

Verified by HoD : Date of Verification :


Department of Computer Science and Engineering
Syllabus - Academic Year (2023-24)

Course Code & Course Name : CS3352 & Foundations of Data Science L T P C

Name of the Faculty : [Link] 3 0 0 3


Year/Sem/Sec : II/III/A

Course Objectives

 To understand the data science fundamentals and process.


 To learn to describe the data for the data science process.
 To learn to describe the relationship between data.
 To utilize the Python libraries for Data Wrangling.
 To present and interpret data using visualization libraries in Python

Course Outcomes
 Define the data science process
 Understand different types of data description for data science process

 Gain knowledge on relationships between data


 Use the Python Libraries for Data Wrangling
 Apply visualization Libraries in Python to interpret and explore data

UNIT I INTRODUCTION 9
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining research
goals – Retrieving data – Data preparation - Exploratory Data analysis – build the model– presenting
findings and building applications - Data Mining - Data Warehousing – Basic Statistical descriptions
of Data
UNIT II DESCRIBING DATA 9
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing data with
Averages - Describing Variability - Normal Distributions and Standard (z) Scores

UNIT III DESCRIBING RELATIONSHIPS 9


Correlation –Scatter plots –correlation coefficient for quantitative data –computational formula for
correlation coefficient – Regression –regression line –least squares regression line – Standard error of
estimate – interpretation of r2 –multiple regression equations –regression towards the mean.
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING 9
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean logic –
fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and selection –
operating on data – missing data – Hierarchical indexing – combining datasets – aggregation and
grouping – pivot tables

UNIT V DATA VISUALIZATION 9


Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three dimensional
plotting - Geographic Data with Basemap - Visualization with Seaborn

Total : 45

Text Books:

Year of
[Link]. Author(s) Title of the Book Publisher
Publication
David Cielen, Manning
1. Arno D. B. Introducing Data Science 2016
Meysman Publications
Robert S. Witte
2. Statistics Wiley Publications 2017
and John S. Witte
3. Jake VanderPlas Python Data Science Handbook O’Reilly 2016

Reference Books:

Year of
[Link]. Author(s) Title of the Book Publisher
Publication
Think Stats: Exploratory Data
1. Allen B. Downey Green Tea Press 2014
Analysis in Python

Course Faculty HoD Principal


LESSON PLAN
LePlan
CSE
2024-25
Course Code & Name : CS3352 & Foundations of Data Science
Name of the Faculty : [Link]
Year / Semester/Section : II/III/A
Actual Date of
Proposed Mode of
Topics Covered Lecture
[Link] Delivery
Date Period Date Period

UNIT I: INTRODUCTION Target


Hours:9
Data Science: Benefits
1 A
and uses facets of data
Data Science Process:
2 A
Overview
3 Defining research goals A
4 Retrieving data A
5 Data preparation A
Exploratory Data
6 A
analysis
build the model–
7 presenting findings and A
building applications
Data Mining
8 D
Data Warehousing
Basic
9 Statistical A
descriptions
of Data
UNIT II:DESCRIBING DATA Target
Hours:9
10 Types of Data A
11 Types of Variables A
Describing Data with
12 A
Tables
Describing Data with
13 A
Graphs
Describing Data with
14 A
Averages
15 Describing Variability A
16 Interquartile range D
17 Normal Distributions A
18 Standard (z) Scores A
Actual Date of
Proposed Mode of
Topics Covered Lecture
[Link] Delivery
Date Period Date Period

UNIT III : DESCRIBING RELATIONSHIPS Target


Hours:9
Correlation –Scatter
19 A
plots
Correlation coefficient
20 A
for quantitative data
Computational formula
21 for correlation A
coefficient
Regression –regression
22 A
line
Least squares regression
23 A
line
Standard error of
24 A
estimate
25 Interpretation of r2 A
Multiple regression
26 A
equations
Regression towards the
27 A
mean
UNIT IV: PYTHON LIBRARIES FOR DATA WRANGLING Target
Hours:9
28 Basics of Numpy arrays A
Aggregations –
29 A
Computations on arrays
Comparisons, masks,
30 A
Boolean logic
Fancy indexing –
31 A
Structured arrays
Data manipulation with
32 A
Pandas
Data indexing and
33 A
Selection
Operating on data –
34 A
Missing data
Hierarchical indexing –
35 A
Combining datasets
Aggregation and
36 A
grouping – Pivot tables
UNIT V: DATA VISUALIZATION Target
Hours:9
Importing Matplotlib –
37 A
Line plots
Scatter Plots –
38 A
Visualizing Errors
Density And Contour
39 A
Plots
40 Histograms – Legends A

41 Colors – Subplots A
Actual Date of
Proposed Mode of
Topics Covered Lecture
[Link] Delivery
Date Period Date Period

Text And Annotation –


42 A
Customization
Three Dimensional A
43 Plotting

44 Geographic Data With A


Basemap

45 Visualization With A
Seaborn
Content Beyond the Syllabus

Dats science tools D

Mode of Delivery:

A:Chalk & Talk B: Audio Visual Aids

C: Flipped Class Room Activity D: PPT,NPTEL Videos, etc.,

E:Group Discussion F: Technical Quiz

G:Group Activity H: Brain Storming

I: Role Play J: MCQ

K:Technical Debate

Course Faculty Signature of HoD Signature of HoD Principal


(with Date) (Beginning of the Semester (End of the Semester
with Date) with Date)
(With Date) (Beginning of the Semester (End of the Semester
with Date) with Date)

PREAMBLE ABOUT THE SUBJECT PREAMBLE

CSE 2024-25

Course Code & Course Name : CS3352 & Foundations of Data Science
Name of the Faculty : [Link]
Year/Sem/Sec : II / III / A Date :

 To understand the data science fundamentals and process.


 To learn to describe the data for the data science process.
 To learn to describe the relationship between data.
Objectives of Study
 To utilize the Python libraries for Data Wrangling.
 To present and interpret data using visualization libraries in Python

Prerequisite knowledge
for complete learning of Requires in depth knowledge of data science and python libraries.
subject
 Commercial applications
 Human Resource management
Benefit / Application of  Financial applications
this Subject
 Fraud detection
 Customer analysis
VIII
Topics Mapping for
Future Semester Subjects Project Work
David Cielen, Arno D. B. Meysman, and Mohamed Ali,
“Introducing Data Science”, Manning Publications, 2016.
(Unit I)
Robert S. Witte and John S. Witte, “Statistics”, Eleventh
Important Books / Edition, Wiley Publications, 2017. (Units II and III)
Journals for Learning Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
(Units IV and V)
Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”,
Green Tea Press,2014
Course Faculty HoD

Department of Computer Science and Engineering


Class Timetable
ODD Semester - Academic Year (2024-25)

Year/Sem/Sec : II/III/A

Day/ 1 2 - 3 4 - 5 6 7
Hour/
09.10- 10.05- 11.00- 11.15- 12.10- 01.00- 01.45- 02.40- 03.35-
Time
10.05 11.00 11.15 12.10 01.00 01.45 02.40 03.35 04.30
Monday DM DS OOPS FDS DS LAB

Tuesday FDS DM DPCO DS FDS LAB


Morning Break

Lunch Break
Wednesday OOPS DS FDS DM OOPS LAB

Thursday NM NM NM

Friday DS OOPS DPCO DM DPCO DPCO LAB

Saturday DM DM FDS OOPS DPCO COUNSELLING

Course
Course Name Abbreviation Faculty Name
Code
MA3354 Discrete Mathematics DM [Link]
Digital principles and computer [Link] kumar &
CS3351 DPCO
organization [Link]
CS3352 Foundations of data science FDS [Link]
CS3301 Data structures DS [Link]
CS3391 Object oriented programming OOPS [Link]
CS3311 Data structure laboratory DS LAB [Link]
Object oriented programming
CS3381 OOPS LAB [Link]
laboratory
CS3361 Data science laboratory FDS LAB [Link]
Class Advisor [Link]
Exam Coordinator [Link]
Placement Coordinator [Link]

Timetable In-charge HoD Principal


Department of Computer Science Engineering
Minutes of Course Committee Meeting - I : Academic Year (2024-25)

Course Code & Course Name : CS3352 & Foundations of Data Science

Course Committee Members Date :

[Link]. Course Faculty Name Branch Year/ Sem Section Faculty Signature
1. [Link]
CSE II/III A
2. [Link]

Points Discussed:
 Planned to give homework problems in difficult topics
 Discussed about portion completion
 To check and correct students handwritten notes periodically.
 Ask to give question bank

Course Coordinator Subject Expert HoD


Department of Computer Science Engineering
Minutes of Course Committee Meeting - II : Academic Year (2023-24)

Course Code & Course Name : CS3352 & Foundations of Data Science

Course Committee Members Date :

[Link]. Course Faculty Name Branch Year/ Sem Section Faculty Signature
1. [Link]
CSE II/III A
2. [Link]

Points Discussed:
 To make the students to concentrate more on previous year anna university questions.
 To make the students understand the subject knowledge
 Discussed about CIA I students feedback and performance
 Identify the weak students and give home test and assessment

Course Coordinator Subject Expert HoD


LECTURE HANDOUTS L1

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Data Science: Benefits and uses , Facets of Data

Data Science :
Data Science along with artificial intelligence (AI) and its various components
such as statistical learning (SL), machine learning (ML) and deep learning
algorithms (DL) are recognized as main drivers of organizational value
creation. According to Dr Jim Gray, Data Science is the fourth paradigm which
drives innovative solutions to organizational problems.
This course is suitable for students/practitioners interested in improving their
knowledge in the fundamental concepts of Data Science. The course will also
prepare the learner for a career in the field of Data Analytics.
Prerequisite knowledge for Complete understanding and learning of Topic:
 Python
 Discrete mathematics

 Statistics
Detailed content of the Lecture:
Introduction:
Data
In computing, data is information that has been translated into a form that is
efficient for movement or processing
Data Science
Data science is an evolutionary extension of statistics capable of dealing with
the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
Benefits and uses of data science
Data science and big data are used almost everywhere in both commercial and noncommercial
Settings
 Commercial companies in almost every industry use data science and big data
to gain insights into their customers, processes, staff, completion, and
products.
 Many companies use data science to offer customers a better user experience,
as well as to cross-sell, up-sell, and personalize their offerings.
 Governmental organizations are also aware of data’s value. Many
governmental organizations not only rely on internal data scientists to discover
valuable information, but also share their data with the public.
 Nongovernmental organizations (NGOs) use it to raise money and defend their
causes.
 Universities use data science in their research but also to enhance the study
experience of their students. The rise of massive open online courses (MOOC)
produces a lot of data, which allows universities to study how this type of
learning can complement traditional classes.
Data Science Process

Overview of the data science process


The typical data science process consists of six steps through which you’ll

iterate
Defining research goals
A project starts by understanding the what, the why, and the how
of your project. The outcome should be a clear research goal, a good
understanding of the context, well-defined deliverables, and a plan of
action with a timetable. This information is then best placed in a project
charter.

Spend time understanding the goals and context of your research


 An essential outcome is the research goal that states the purpose of
your assignment in a clear and focused manner.
 Understanding the business goals and context is critical for project
success.
 Continue asking questions and devising examples until you grasp
the exact business expectations.

Create a project charter


A project charter requires teamwork, and your input covers at least the
following:
 A clear research goal
 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline
Retrieving data
 The next step in data science is to retrieve the required data.
Sometimes you need to go into the field and design a data
collection process yourself, but most of the time you won’t be
involved in this step.
 Many companies will have already collected and stored the data for
you, and what they don’t have can often be bought from third
parties.
 More and more organizations are making even high-quality data
freely available for public and commercial use.
 Data can be stored in many forms, ranging from simple text files to
tables in a database. The objective now is acquiring all the data you
need.

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 “Data science from scratch”, author :joel grus, page nos 403
 “Essential math of data science”, author:Thomas neild, page nos 347
 “Become a data head”, author::alex gutman, page nos 272

Course Faculty

Verified by HoD
LECTURE HANDOUTS
L2

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Data Science Process: Overview


In data science and big data you’ll come across many different types of data,
and each of them tends to require different tools and techniques. The main
categories of data are these:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and images
 Streaming

Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Discrete mathematics

 Statistics
Detailed content of the Lecture:

Facets of Data:
Structured data

 Structured data is data that depends on a data model and resides in a


fixed field within a record. As such, it’s often easy to store structured
data in tables within databases or Excel files
 SQL, or Structured Query Language, is the preferred way to manage
and query data that resides in databases.
 The structured data that can be accessed by sql query language.
 It has a predictive model

Unstructured data

Unstructured data is data that isn’t easy to fit into a data model because
the content is context-specific or varying. One example of unstructured data

is your regular email

Natural language

Natural language is a special type of unstructured data; it’s challenging


to process because it requires knowledge of specific data science
techniques and linguistics.
 The natural language processing community has had success in entity
recognition, topic recognition, summarization, text completion, and
sentiment analysis, but models trained in one domain don’t generalize
well to other domains.
 Even state-of-the-art techniques aren’t able to decipher the meaning of
every piece of text.
Machine-generated data
 Machine-generated data is information that’s automatically created
by a computer, process, application, or other machine without
human intervention.
 Machine-generated data is becoming a major data resource and will
continue to do so.

Graph-based or network data

“Graph data” can be a confusing term because any data can be shown in
a graph.
 Graph or network data is, in short, data that focuses on the relationship
or adjacency of objects.
 The graph structures use nodes, edges, and properties to represent and
store graphical data.
 Graph-based data is a natural way to represent social networks,
and its structure allows you to calculate specific metrics such as
the influence of a person and the shortest path between two people.
Audio, image, and video

 Audio, image, and video are data types that pose specific challenges to a
data scientist.
 Tasks that are trivial for humans, such as recognizing objects in
pictures, turn out to be challenging for computers.
 MLBAM (Major League Baseball Advanced Media) announced in
2014 that they’ll increase video capture to approximately 7 TB per
game for the purpose of live, in-game analytics.
 Recently a company called Deep Mind succeeded at creating an
algorithm that’s capable of learning how to play video games.
 This algorithm takes the video screen as input and learns to
interpret everything via a complex of deep learning.

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 “Data science from scratch”, author :joel grus, page nos 403
 “Essential math of data science”, author:Thomas neild, page nos 347
 “Become a data head”, author::alex gutman, page nos 272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L3

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Defining Research Goals


Introduction
Data science is an evolutionary extension of statistics capable of dealing with
the massive amounts of data produced today. It adds methods from computer
science to the repertoire of statistics.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Discrete mathematics

 Statistics
Data Science Process:

Detailed content of the Lecture:

Overview of the data science process


The typical data science process consists of six steps through which you’ll
iterate, as shown in figure
1. The first step of this process is setting a research goal. The main
purpose here is making sure all the stakeholders understand the
what, how, and why of the project. In every serious project this will
result in a project charter.
2. The second phase is data retrieval. You want to have data available
for analysis, so this step includes finding suitable data and getting
access to the data from the data owner. The result is data in its raw
form, which probably needs polishing and transformation before it
becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in
your models. To achieve this, you’ll detect and correct different kinds of
errors in the data, combine data from different data sources, and
transform it. If you have successfully completed this step, you can
progress to data visualization and modeling

4. The fourth step is data exploration. The goal of this step is to gain a
deep understanding of the data. You’ll look for patterns,
correlations, and deviations based on visual and descriptive
techniques. The insights you gain from this phase will enable you to
start modeling.
5. Finally, we get to model building (often referred to as “data
modeling” throughout this book). It is now that you attempt to gain
the insights or make the predictions stated in your project charter.
Now is the time to bring out the heavy guns, but remember research
has taught us that often (but not always) a combination of simple
models tends to outperform one complicated model. If you’ve done
this phase right, you’re almost done.
6. The last step of the data science model is presenting your results and
automating the analysis, if needed. One goal of a project is to change a
process and/or make better decisions. You may still need to convince the
business that your findings will indeed change the business process as
expected. This is where you can shine in your influencer role. The
importance of this step is more apparent in projects on a strategic and
tactical level. Certain projects require you to perform the business
process over and over again, so automating the project will save time

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 “Data science from scratch”, author :joel grus, page nos 403
 “Essential math of data science”, author:Thomas neild, page nos 347
 “Become a data head”, author::alex gutman, page nos 272
C
ourse Faculty

Verified by HoD

LECTURE HANDOUTS L4

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Retrieving Data


A project starts by understanding the what, the why, and the how of your
project. The outcome should be a clear research goal, a good understanding of
the context, well-defined deliverables, and a plan of action with a timetable. This
information is then best placed in a project charter.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Discrete mathematics

 Statistics
Detailed content of the Lecture:

Defining research goals:

Spend time understanding the goals and context of your research

 An essential outcome is the research goal that states the purpose of


your assignment in a clear and focused manner.
 Understanding the business goals and context is critical for project
success.
 Continue asking questions and devising examples until you grasp
the exact business expectations, identify how your project fits in the
bigger picture, appreciate how your research is going to change the
business, and understand how they’ll use your results

Create a project charter


A project charter requires teamwork, and your input covers at least the
following:
 A clear research goal
 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success

Retrieving data
 The next step in data science is to retrieve the required data.
Sometimes you need to go into the field and design a data
collection process yourself, but most of the time you won’t be
involved in this step.
 Many companies will have already collected and stored the data for
you, and what they don’t have can often be bought from third
parties.
 More and more organizations are making even high-quality data
freely available for public and commercial use.
 Data can be stored in many forms, ranging from simple text files to
tables in a database. The objective now is acquiring all the data you
need.
 Most companies have a program for maintaining key data, so much
of the cleaning work may already be done. This data can be stored
in official data repositories such as databases, data marts, data
warehouses, and data lakes maintained by a team of IT
professionals.
 Data warehouses and data marts are home to preprocessed data,
data lakes contain data in its natural or raw format.

Externl Data
 If data isn’t available inside your organization, look outside your
organizations. Companies provide data so that you, in turn, can
enrich their services and ecosystem. Such is the case with
Twitter, LinkedIn, and Facebook.
 More and more governments and organizations share their data for free
with the world.
 A list of open data providers that should get you started.

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 “Data science from scratch”, author :joel grus, page nos 403
 “Essential math of data science”, author:Thomas neild, page nos 347
 “Become a data head”, author::alex gutman, page nos 272

Cou
rse Faculty

Veri
fied by HoD

LECTURE HANDOUTS L5

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Data Preparation


Introduction :

Data in a specific format, so data transformation will always come into [Link]’s a
good habit to correct data errors as early on in the process as [Link],
this isn’t always possible in a realistic setting, so you’ll need to take corrective
actions in your program.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Discrete mathematics

 Statistics
Introduction to Data Preparation:

Your model needs the data in a specific format, so data transformation will
always come into play. It’s a good habit to correct data errors as early on in
the process as possible. However, this isn’t always possible in a realistic
setting, so you’ll need to take corrective actions in your program.
Cleansing data

Data cleansing is a sub process of the data science process that focuses on
removing errors in your data so your data becomes a true and consistent
representation of the processes it originates from.
 The first type is the interpretation error, such as when you take the
value in your data for granted, like saying that a person’s age is greater
than 300 years.
 The second type of error points to inconsistencies between data
sources or against your company’s standardized values.
An example of this class of errors is putting “Female” in one table and “F”
in another when they represent the same thing: that the person is female.

Overview of common errors

Data Entry Errors


 Data collection and data entry are error-prone processes. They often
require human intervention, and introduce an error into the chain.
 Data collected by machines or computers isn’t free from errors. Errors
can arise from human sloppiness, whereas others are due to machine
or hardware failure.
 Detecting data errors when the variables you study don’t have many
classes can be done by tabulating the data with counts.
 When you have a variable that can take only two values: “Good” and
“Bad”, you can create a frequency table and see if those are truly the
only two values present. In table the values “Godo” and “Bade” point
out something went wrong in at least 16 cases

x = “Bad

Redundant Whitespace
 Whitespaces tend to be hard to detect but cause errors like other redundant
characters would.
 The whitespace cause the miss match in the string such as “FR ” –
“FR”, dropping the observations that couldn’t be matched.
 If you know to watch out for them, fixing redundant whitespaces is
luckily easy enough in most programming languages. They all provide
string functions that will remove the leading and trailing whitespaces.
For instance, in Python you can use the strip() function to remove
leading and trailing spaces.

Fixing Capital Letter Mismatches


Capital letter mismatches are common. Most programming languages make
a distinction between “Brazil” and “brazil”.
In this case you can solve the problem by applying a function that returns both
strings in lowercase, such as
.lower() in Python. “Brazil”.lower() == “brazil”.lower() should result in true.

Impossible Values and Sanity Checks


Here you check the value against physically or theoretically impossible values
such as people taller than 3 meters or someone with an age of 299 years.
Sanity checks can be directly expressed with rules:
check = 0 <= age <= 120

Outliers
An outlier is an observation that seems to be distant from other observations
or, more specifically, one observation that follows a different logic or
generative process than the other observations. The easiest way to find
outliers is to use a plot or a table with the minimum and maximum values.
The plot on the top shows no outliers, whereas the plot on the bottom shows
possible outliers on the upper side when a normal distribution is expected.

Joining Tables
 Joining tables allows you to combine the information of one observation
found in one table with the information that you find in another table.
The focus is on enriching a single observation.
 Let’s say that the first table contains information about the purchases
of a customer and the other table contains information about the region
where your customer lives.
 Joining the tables allows you to combine the information so that you
can use it for your model, as shown in figure.

Transforming data

Certain models require their data to be in a certain shape. Transforming your


data so it takes a suitable form for data modeling.

a relationship of the form y = aebx. Taking the log of the independent variables
simplifies the estimation problem dramatically. Transforming the input variables
greatly simplifies the estimation problem. Other times you might want to combine
two variables into a new variable.

Reducing the Number of Variables


 Having too many variables in your model makes the model difficult to
handle, and certain techniques don’t perform well when you overload
them with too many input variables. For instance, all the techniques
based on a Euclidean distance perform well only up to 10 variable
 Data scientists use special methods to reduce the number of variables
but retain the maximum amount of data.

Turning Variables into Dummies

 Dummy variables can only take two values: true (1) or false (0). They’re
used to indicate the absence of a categorical effect that may explain
the observation.
 In this case you’ll make separate columns for the classes stored in one
variable and indicate it with 1 if the class is present and 0 otherwise.
 An example is turning one column named Weekdays into the columns
Monday through Sunday. You use an indicator to show if the
observation was on a Monday; you put 1 on Monday and 0 else 1.
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]

Important Books/Journals for further learning including the page nos.:


 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

L6
LECTURE HANDOUTS

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Exploratory data analysis


Introduction

Exploratory data analysis you take a deep dive into the data (see figure below).
Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Discrete mathematics

 Statistics
Detailed content of the Lecture:
Exploratory data analysis

During exploratory data analysis you take a deep dive into the data (see figure
below). Information becomes much easier to grasp when shown in a picture,
therefore you mainly use graphical techniques to gain an understanding of your data
and the interactions between variables.

The goal isn’t to cleanse the data, but it’s common that you’ll still discover
anomalies you missed before, forcing you to take a step back and fix them.
EDA is not identical to statistical graphics although the two terms are used
almost interchangeably. Statistical graphics is a collection of techniques--all
graphically based and all focusing on one data characterization aspect. EDA
encompasses a larger venue; EDA is an approach to data analysis that
postpones the usual assumptions about what kind of model the data follow with
the more direct approach of allowing the data itself to reveal its underlying
structure and model. EDA is not a mere collection of techniques; EDA is a
philosophy as to how we dissect a data set; what we look for; how we look; and
how we interpret. It is true that EDA heavily uses the collection of techniques
that we call "statistical graphics", but it is not identical to statistical

 The visualization techniques you use in this phase range from


simple line graphs or histograms, as shown in below figure , to
more complex diagrams such as Sankey and network graphs.
 Sometimes it’s useful to compose a composite graph from simple
graphs to get even more insight into the data Other times the
graphs can be animated or made interactive to make it easier
and, let’s admit it, way more fun

The techniques we described in this phase are mainly visual, but in


practice they’re certainly not limited to visualization techniques.
Tabulation, clustering, and other modeling techniques can also be a part of
exploratory analysis. Even building simple models can be a part of this
step.
Classical estimation techniques have the characteristic of taking all of the
data and mapping the data into a few numbers ("estimates"). This is both a
virtue and a vice. The virtue is that these few numbers focus on important
characteristics (location, variation, etc.) of the population. The vice is that
concentrating on these few characteristics can filter out other characteristics
(skewness, tail length, autocorrelation, etc.) of the same population. In this
sense there is a loss of information due to this "filtering" process.
The "good news" of the classical approach is that tests based on classical
techniques are usually very sensitive--that is, if a true shift in location, say, has
occurred, such tests frequently have the power to detect such a shift and to
conclude that such a shift is "statistically significant". The "bad news" is that
classical tests depend on underlying assumptions (e.g., normality), and hence
the validity of the test conclusions becomes dependent on the validity of the
underlying assumptions. Worse yet, the exact underlying assumptions may be
unknown to the analyst, or if known, untested. Thus the validity of the scientific
conclusions becomes intrinsically linked to the validity of the underlying
assumptions. In practice, if such assumptions are unknown or untested, the
validity of the scientific conclusions becomes suspect.
Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L7

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : I - Introduction Date of Lecture:

Topic of Lecture: Build the model – Presenting findings and building Applications
Introduction :

The principle here is simple: the model should work on unseen data. You
use only a fraction of your data to estimate the model and the other part, the
holdout sample, is kept out of the equation.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Discrete mathematics

 Statistics
Model Building:

Detailed content of the Lecture:

 With clean data in place and a good understanding of the content,


you’re ready to build models with the goal of making better
predictions, classifying objects, or gaining an understanding of the
system that you’re modeling.
 This phase is much more focused than the exploratory analysis
step, because you know what you’re looking for and what you
want the outcome to be.

Building a model is an iterative process. The way you build your model
depends on whether you go with classic statistics or the somewhat more
recent machine learning school, and the type of technique you want to
use. Either way, most models consist of the following main steps:
 Selection of a modeling technique and variables to enter in the model
 Execution of the model
 Diagnosis and model comparison
Model and variable selection
You’ll need to select the variables you want to include in your model and
a modeling technique. You’ll need to consider model performance and
whether your project meets all the requirements to use your model, as
well as other factors:
 Must the model be moved to a production environment and, if so,
would it be easy to implement?
 How difficult is the maintenance on the model: how long will it remain
relevant if left untouched?
 Does the model need to be easy to explain?
Model execution
 Once you’ve chosen a model you’ll need to implement it in code.
 Most programming languages, such as Python, already have
libraries such as StatsModels or Scikit- learn. These packages use
several of the most popular techniques.
 Coding a model is a nontrivial task in most cases, so having these
libraries available can speed up the process. As you can see in the
following code, it’s fairly easy to use linear regression with
StatsModels or Scikit-learn
 Doing this yourself would require much more effort even for the
simple techniques. The following listing shows the execution of a
linear prediction model.

 You’ll be building multiple models from which you then choose


the best one based on multiple criteria. Working with a holdout
sample helps you pick the best-performing model.
 A holdout sample is a part of the data you leave out of the model
building so it can be used to evaluate the model afterward.
 The principle here is simple: the model should work on unseen
data. You use only a fraction of your data to estimate the model
and the other part, the holdout sample, is kept out of the equation.
 The model is then unleashed on the unseen data and error measures
are calculated to evaluate it.

 Multiple error measures are available, and in figure we show the


general idea on comparing models. The error measure used in the
example is the mean square error.
Formula for mean square error.

Mean square error is a simple measure: check for every prediction how
far it was from the truth, square this error, and add up the error of every
prediction.
 To estimate the models, we use 800 randomly chosen
observations out of 1,000 (or 80%), without showing the other
20% odata to the model.
 Once the model is trained, we predict the values for the other 20%
of the variables based on those for which we already know the true
value, and calculate the model error with an error measure.
 Then we choose the model with the lowest error. In this example
we chose model 1 because it has the lowest total error.

Many models make strong assumptions, such as independence of the inputs,


and you have to verify that these assumptions are indeed met. This is called
model diagnostics.
Presenting findings and building applications

 Sometimes people get so excited about your work that you’ll need
to repeat it over and over again because they value the
predictions of your models or the insights that you produced.
 This doesn’t always mean that you have to redo all of your analysis
all the time. Sometimes it’s sufficient that you implement only the
model scoring; other times you might build an application that
automatically updates reports, Excel spreadsheets, or PowerPoint
presentations. The last stage of the data science process is where
your soft skills will be most useful, and yes, they’re extremely
important.

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty
Verified by
HoD

LECTURE HANDOUTS L8

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science
Course Faculty : [Link]
Unit : I - Introduction Date of Lecture:

Topic of Lecture: Data Mining and Data Warehousing


Introduction :
 Data mining is the process of discovering actionable information from
large sets of data. Data mining uses mathematical analysis to derive
patterns and trends that exist in data.
 Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making
Prerequisite knowledge for Complete understanding and learning of Topic:
 Database concepts

 SQL
Detailed content of the Lecture:

Data mining is the process of discovering actionable information from


large sets of data. Data mining uses mathematical analysis to derive
patterns and trends that exist in data. Typically, these patterns cannot be
discovered by traditional data exploration because the relationships are
too complex or because there is too much data.
These patterns and trends can be collected and defined as a data mining
model. Mining models can be applied to specific scenarios, such as:
 Forecasting: Estimating sales, predicting server loads or server
downtime
 Risk and probability: Choosing the best customers for targeted
mailings, determining the probable break-even point for risk
scenarios, assigning probabilities to diagnoses or other outcomes
 Recommendations: Determining which products are likely to
be sold together, generating recommendations
 Finding sequences: Analyzing customer selections in a shopping
cart, predicting next likely events
 Grouping: Separating customers or events into cluster of related
items, analyzing and predicting affinities
Building a mining model is part of a larger process that includes
everything from asking questions about the data and creating a model to
answer those questions, to deploying the model into a working
environment. This process can be defined by using the following six basic
steps:
1. Defining the Problem
2. Preparing Data
3. Exploring Data
4. Building Models
5. Exploring and Validating Models
6. Deploying and Updating Models
The following diagram describes the relationships between each step in the
process, and the technologies in Microsoft SQL Server that you can use to
complete each step.

Defining the Problem


The first step in the data mining process is to clearly define the problem, and
consider ways that data can be utilized to provide an answer to the problem.
This step includes analyzing business requirements, defining the scope of
the problem, defining the metrics by which the model will be evaluated,
and defining specific objectives for the data mining project. These tasks
translate into questions such as the following:
 What are you looking for? What types of relationships are you trying to
find?
 Does the problem you are trying to solve reflect the policies or
processes of the business?
 Do you want to make predictions from the data mining model, or
just look for interesting patterns and associations?
 Which outcome or attribute do you want to try to predict?
 What kind of data do you have and what kind of information is in
each column? If there are multiple tables, how are the tables
related? Do you need to perform any cleansing, aggregation, or
processing to make the data usable?
 How is the data distributed? Is the data seasonal? Does the data
accurately represent the processes of the business?

Preparing Data
 The second step in the data mining process is to consolidate and
clean the data that was identified in the Defining the Problem step.
 Data can be scattered across a company and stored in different
formats, or may contain inconsistencies such as incorrect or missing
entries.
 Data cleaning is not just about removing bad data or interpolating
missing values, but about finding hidden correlations in the data,
identifying sources of data that are the most accurate, and
determining which columns are the most appropriate for use in
analysis.
Exploration techniques include calculating the minimum and maximum
values, calculating mean and standard deviations, and looking at the
distribution of the data. For example, you might determine by reviewing
the maximum, minimum, and mean values that the data is not
representative of your customers or business processes, and that you
therefore must obtain more balanced data or review the assumptions that
are the basis for your expectations. Standard deviations and other
distribution values can provide useful information about the stability and
accuracy of the results.
Building Models
The mining structure is linked to the source of data, but does not actually
contain any data until you process it. When you process the mining
structure, SQL Server Analysis Services generates aggregates and other
statistical information that can be used for analysis. This information can
be used by any mining model that is based on the structure
Exploring and Validating Models
Before you deploy a model into a production environment, you will want to
test how well the model performs. Also, when you build a model, you
typically create multiple models with different configurations and test all
models to see which yields the best results for your problem and your
data.
Deploying and Updating Models
After the mining models exist in a production environment, you can
perform many tasks, depending on your needs. The following are some of
the tasks you can perform:
 Use the models to create predictions, which you can then use to
make business decisions.
 Create content queries to retrieve statistics, rules, or formulas from the
model.
 Embed data mining functionality directly into an application. You
can include Analysis Management Objects (AMO), which contains a
set of objects that your application can use to create, alter, process,
and delete mining structures and mining models.
 Use Integration Services to create a package in which a mining
model is used to intelligently separate incoming data into multiple
tables.
 Create a report that lets users directly query against an existing mining
model
Data warehousing
Data warehousing is the process of constructing and using a data
warehouse. A data warehouse is constructed by integrating data from
multiple heterogeneous sources that support analytical reporting,
structured and/or ad hoc queries, and decision making. Data warehousing
involves data cleaning, data integration, and data consolidations.

Characteristics of data warehouse


The main characteristics of a data warehouse are as follows:
 Subject-Oriented
A data warehouse is subject-oriented since it provides topic-wise
information rather than the overall processes of a business. Such
subjects may be sales, promotion, inventory, etc
 Integrated
A data warehouse is developed by integrating data from varied
sources into a consistent format. The data must be stored in the
warehouse in a consistent and universally acceptable manner in
terms of naming, format, and coding. This facilitates effective data
analysis.
Database vs. Data Warehouse
Although a data warehouse and a traditional database share some
similarities, they need not be the same idea. The main difference is that in a
database, data is collected for multiple transactional purposes. However, in a
data warehouse, data is collected on an extensive scale to perform analytics.
Databases provide real-time data, while warehouses store data to be accessed
for big analytical queries.
Data Warehouse Architecture
Usually, data warehouse architecture comprises a three-tier structure.
Bottom Tier
The bottom tier or data warehouse server usually represents a relational
database system. Back-end tools are used to cleanse, transform and feed
data into this layer.
Middle Tier
The middle tier represents an OLAP server that can be implemented in two
ways.
The ROLAP or Relational OLAP model is an extended relational database
management system that maps multidimensional data process to standard
relational process.
The MOLAP or multidimensional OLAP directly acts on multidimensional data
and operations.
Top Tier
This is the front-end client interface that gets data out from the data
warehouse. It holds various tools like query tools, analysis tools, reporting
tools, and data mining tools.
How Data Warehouse Works
Data Warehousing integrates data and information collected from various
sources into one comprehensive database. For example, a data warehouse
might combine customer information from an organization’s point- of-sale
systems, its mailing lists, website, and comment cards. It might also
incorporate confidential information about employees, salary information,
etc. Businesses use such components of data warehouse to analyze
customers.
Types of Data Warehouse
There are three main types of data warehouse.
Enterprise Data Warehouse (EDW)
This type of warehouse serves as a key or central database that facilitates
decision-support services throughout the enterprise. The advantage to this
type of warehouse is that it provides access to cross-organizational
information, offers a unified approach to data representation, and allows
running complex queries.
Operational Data Store (ODS)
This type of data warehouse refreshes in real-time. It is often preferred for
routine activities like storing employee records. It is required when data
warehouse systems do not support reporting needs of the business.
Data Mart
A data mart is a subset of a data warehouse built to maintain a particular
department, region, or business unit. Every department of a business has
a central repository or data mart to store data. The data from the data
mart is stored in the ODS periodically. The ODS then sends the data to the
EDW, where it is stored and used
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347

Course Faculty

Verified by HoD

LECTURE HANDOUTS L9

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science


Course Faculty : [Link]
Unit : I - Introduction Date of Lecture:

Topic of Lecture: Basic Statistical Descriptions of data

Statistical Descriptions of data

Introduction :

Data consist of numbers, of course. But these numbers are fed into the
computer, not produced by it. These are numbers to be treated with considerable
respect, neither to be tampered with, nor subjected to a numerical process whose
character you do not completely understand. You are well advised to acquire a
reverence for data that is rather different from the “sporty” attitude that is
sometimes allowable, or even commendable, in other numerical tasks.
The analysis of data inevitably involves some trafficking with the field of
statistics, that gray area which is not quite a branch of mathematics and just as
surely not quite a branch of science. In the following sections, you will repeatedly
encounter the following paradigmpply some formula to the data to compute “a
statistic”. compute where the value of that statistic falls in a probability distribution
that is computed on the basis of some “null hypothesis”. if it falls in a very unlikely
spot, way out on a tail of the distribution, conclude that the null hypothesis is false for
your data set.

Prerequisite knowledge for Complete understanding and learning of Topic:


 Database concepts
 SQL
Detailed content of the Lecture:
Statistics is a branch of mathematics which deals with numbers and data
analysis. Statistics is the study of the collection, analysis, interpretation,
presentation, and organization of data. Statistical theory defines a statistic as a
function of a sample where the function itself is independent of the sample’s
distribution.
In short, Statistics is associated with collecting, classifying, arranging and
presenting numerical data. It allows us to interpret various results from it and
forecast many possibilities. Statistics deals with facts, observations and information
which are in the form of numeric data only. With the help of statistics, we are able to
find various measures of central tendencies and the deviation of different values from
the center.
Random sampling— In experiments where a sample or subset of plants,
leaves, insects, etc. will be treated, counted, or observed for collecting data, the
sample taken should be random. If all the data are collected from one plant or
area of the habitat it will be biased and not be a good indicator of what is really
going on ecologically.
An example in everyday life would be a researcher wanting to
determine the average height of fifth graders in a particular school district. If she
only measured boys her results would only apply to boys, not all fifth graders,
and would thus be biased, not random. To collect unbiased data she would
randomly choose the same number of boys and girls from each fifth grade class
to measure. She could do this by assigning every child a number and then
pulling numbers from a hat. These days, there are simple computer programs to
do the picking. Here is an ecological example: The experimental design calls for
observing what food items red ants bring back to their colony as compared to
black ants. You have too many ant colonies to observe all of them, so you pick a
random sample of 5 colonies of each ant type to observe. An easy way to choose
randomly is by giving each colony a number or letter on a slip of paper. Put
these in a basket and pull 5 slips for each ant colony type. This way there is no
bias toward any particular colonies.
BASIC STATISTICAL FORMULAS
The following is a very brief primer on statistics. The intention is to whet your
appetite and provide some familiarity with basic statistics. Any number of texts
exist that can explain these and other statistical methods in detail. Averages
(mean)— An average or mean is calculated as the sum of the numbers for one
group of treated organisms or plots of land (the experimental units), divided by
the number of organisms or plots. Example: You run an experiment on 10 plants,
with 5 plants being treated to nitrogen and 5 not receiving nitrogen. Each
organism or plot is an experimental unit, so in this case you have 5 replicates in
each treatment group.
Median— is a statistic of location. It is that value of the variable that has
an equal number of items on either side of it. For example, in the nitrogen
experiment above, the median plant height without nitrogen is found by first
ordering the heights, 4, 5, 7, 9, 10. The median is 7 because it has the same
number of observations above and below it. If there is an even number of
observations (e.g. 4, 5, 7, 9) the median is halfway between the middle, in this
case it would be 6 since that is halfway between 5 and 7. The median is
commonly used in describing household income. If the median income in an
area is $35,000, then half the households have incomes less than $35, 000 and
half have incomes greater than $35,000.
Mode— Is the value represented by the greatest number of individuals
in the sample. E.g. in a sample of twenty-five insects, five are beetles, seven are
flies, ten are spiders, and three are moths. The mode is the spiders with ten
individuals. The mode is the least used descriptive statistic. Range— Range is a
measure of dispersion. It is the difference between the smallest and largest
items in a sample. The range of values in the table above for No Nitrogen is from
10 cm to 4 cm. The range is a good indicator of variance in small data sets, but
as data sets get larger with many values, variance and standard deviation are
used to determine the range of dispersion around the mean.
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347

Course Faculty

Verified by HoD

LECTURE HANDOUTS L10

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : II – Describing Data Date of Lecture:

Topic of Lecture: Types of Data

Introduction :

Today data is everywhere in every field. Whether you are a data scientist,
marketer, businessman, data analyst, researcher, or you are in any other profession,
you need to play or experiment with raw or structured data. This data is so important
for us that it becomes important to handle and store it properly, without any error.
While working on these data, it is important to know the types of data to process them
and get the right results.

Prerequisite knowledge for Complete understanding and learning of Topic:


 DBMS

 SQL
Detailed content of the Lecture:

The data is classified into four categories:

 Nominal data.
 Ordinal data.
 Discrete data.
 Continuous data.
Now business runs on data, and most companies use data for their insights to create
and launch campaigns, design strategies, launch products and services or try out
different things. According to a report, today, at least 2.5 quintillion bytes of data are
produced per day.

Types of Data
Qualitative or Categorical Data
Qualitative or Categorical Data is data that can’t be measured or counted in the form
of numbers. These types of data are sorted by category, not by number. That’s why it
is also known as Categorical Data. These data consist of audio, images, symbols, or
text. The gender of a person, i.e., male, female, or others, is qualitative data.
Qualitative data tells about the perception of people. This data helps market
researchers understand the customers’ tastes and then design their ideas and
strategies accordingly. .

The other examples of qualitative data are:


 What language do you speak
 Favorite holiday destination
 Opinion on something (agree, disagree, or neutral)
 Colors
The Qualitative data are further classified into two parts :
Nominal Data
Nominal Data is used to label variables without any order or quantitative value. The
color of hair can be considered nominal data, as one color can’t be compared with
another color. The name “nominal” comes from the Latin name “nomen,” which means
“name.” With the help of nominal data, we can’t do any numerical tasks or can’t give
any order to sort the data. These data don’t have any meaningful order; their values
are distributed into distinct categories.
Examples of Nominal Data :

Colour of hair (Blonde, red, Brown, Black, etc.)

 Marital status (Single, Widowed, Married)


 Nationality (Indian, German, American)
 Gender (Male, Female, Others)
 Eye Color (Black, Brown, etc.)
Ordinal Data
Ordinal data have natural ordering where a number is present in some kind of order by
their position on the scale. These data are used for observation like customer
satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them.
Ordinal data is qualitative data for which their values have some kind of relative
position. These kinds of data can be considered “in-between” qualitative and
quantitative data. The ordinal data only shows the sequences and cannot use for
statistical analysis. Compared to nominal data, ordinal data have some kind of order
that is not present in nominal data.

Examples of Ordinal Data :


 When companies ask for feedback, experience, or satisfaction on a scale of 1 to
10
 Letter grades in the exam (A, B, C, D, etc.)
 Ranking of people in a competition (First, Second, Third, etc.)
 Economic Status (High, Medium, and Low)
 Education Level (Higher, Secondary, Primary)
Difference between Nominal and Ordinal Data

Nominal Data Ordinal Data

Ordinal data gives some kind


Nominal data can’t be quantified,
of sequential order by their
neither they have any intrinsic ordering
position on the scale

Ordinal data is said to be “in-


Nominal data is qualitative data or
between” qualitative data
categorical data
and quantitative data

They don’t provide any quantitative They provide sequence and


value, neither can we perform any can assign numbers to ordinal
data but cannot perform the
arithmetical operation
arithmetical operation

Ordinal data can help to


Nominal data cannot be used to compare one item with
compare with one another another by ranking or
ordering

Examples: Economic status,


Examples: Eye color, housing style,
customer satisfaction,
gender, hair color, religion, marital
education level, letter grades,
status, ethnicity, etc
etc
Discrete Data

The term discrete means distinct or separate. The discrete data contain the values
that fall under integers or whole numbers. The total number of students in a class is an
example of discrete data. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not
possible. These data are represented mainly by a bar graph, number line, or frequency
table.

Examples of Discrete Data :


 Total numbers of students present in a class
 Cost of a cell phone
 Numbers of employees in a company
 The total number of players who participated in a competition
 Days in a week

Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous variable
can take any value within a range. The key difference between discrete and
continuous data is that discrete data contains the integer or whole [Link]
of Continuous Data :

 Height of a person
 Speed of a vehicle
 “Time-taken” to finish the work
 Wi-Fi Frequency
Difference between Discrete and Continuous Data

Discrete Data Continuous Data

Discrete data are countable and finite; Continuous data are measurable; they
they are whole numbers or integers are in the form of fractions or decimal

Discrete data are represented mainly by Continuous data are represented in the
bar graphs form of a histogram

The values cannot be divided into The values can be divided into
subdivisions into smaller pieces subdivisions into smaller pieces
Discrete Data Continuous Data

Discrete data have spaces between the


values Continuous data are in the form of a
continuous sequence

Examples: Total students in a class, Example: Temperature of room, the


number of days in a week, size of a shoe, weight of a person, length of an object,
etc etc
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L11

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science


Course Faculty : [Link]
Unit : II – Describing Data Date of Lecture:

Topic of Lecture: Types of Variables


Introduction :
A variable is a characteristic or property that can take on different values.
The weights can be described not only as quantitative data but also as
observations for a quantitative variable, since the various weights take on
different numerical values.
By the same token, the replies can be described as observations for a
qualitative variable, since the replies to the Facebook profile question take
on different values of either Yes or No.
Given this perspective, any single observation can be described as a
constant, since it takes on only one value.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Software engineering concepts

 Mathematics
Detailed content of the Lecture:

TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
 The weights can be described not only as quantitative data but also
as observations for a quantitative variable, since the various
weights take on different numerical values.
 By the same token, the replies can be described as observations
for a qualitative variable, since the replies to the Facebook profile
question take on different values of either Yes or No.
 Given this perspective, any single observation can be described as
a constant, since it takes on only one value.

Discrete and Continuous Variables


Quantitative variables can be further distinguished as discrete or
continuous.

A discrete variable consists of isolated numbers separated by gaps.

Discrete variables can only assume specific values that you cannot
subdivide. Typically, you count discrete values, and the results are
integers.
Examples
 Counts- such as the number of children in a family. (1, 2, 3, etc., but
never 1.5)
 These variables cannot have fractional or decimal values. You can have
20 or 21 cats, but not 20.5
 The number of heads in a sequence of coin tosses.
 The result of rolling a die.
 The number of patients in a hospital.
 The population of a country.
While discrete variables have no decimal places, the average of these
values can be fractional. For example, families can have only a discrete
number of children: 1, 2, 3, etc. However, the average number of children
per family can be 2.2.

A continuous variable consists of numbers whose values, at least in theory, have


no restrictions.

Continuous variables can assume any numeric value and can be


meaningfully split into smaller parts. Consequently, they have valid
fractional and decimal values. In fact, continuous variables have an infinite
number of potential values between any two points. Generally, you
measure them using a scale. Examples of continuous variables include
weight, height, length, time, and temperature.
Durations, such as the reaction times of grade school children to a fire
alarm; and standardized test scores, such as those on the Scholastic
Aptitude Test (SAT).
Independent Variable
In an experiment, an independent variable is the treatment manipulated by
the investigator.

 Independent variables (IVs) are the ones that you include in the
model to explain or predict changes in the dependent variable.
 Independent indicates that they stand alone and other variables in the
model do not influence them.
 Independent variables are also known as predictors, factors,
treatment variables, explanatory variables, input variables, x-
variables, and right-hand variables—because they appear on the
right side of the equals sign in a regression equation.
 It is a variable that stands alone and isn't changed by the other variables
you are trying to measure.
For example, someone's age might be an independent variable.
Other factors (such as
what they eat, how much they go to school, how much television
they watch)
Dependent Variable
When a variable is believed to have been influenced by the independent
variable, it is called a dependent variable. In an experimental setting,
the dependent variable is measured, counted, or recorded by the
investigator.

 The dependent variable (DV) is what you want to use the model to
explain or predict. The values of this variable depend on other
variables.
 It’s also known as the response variable, outcome variable, and
left-hand variable. Graphs place dependent variables on the
vertical, or Y, axis.
 a dependent variable is exactly what it sounds like. It is something that
depends on other factors.

For example the blood sugar test depends on what food you ate, at which time you
ate etc.
Unlike the independent variable, the dependent variable isn’t manipulated by the
investigator. Instead, it represents an outcome: the data produced by the experiment

Cofounding variable

An uncontrolled variable that compromises the interpretation of a study is


known as a confounding variable. Sometimes a confounding variable occurs because
it’s impossible to assign subjects randomly to different conditions.
Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272
Course Faculty

Verified by HoD

LECTURE L12
HANDOUTS
CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science


Course Faculty : [Link]
Unit : II – Describing Data Date of Lecture:
Topic of Lecture: Describing Data with tables
Introduction :
When observations are sorted into classes of more than one value result is referred to
as a frequency

Prerequisite knowledge for Complete understanding and learning of Topic:

 Table structures
 SQL

 Database concepts
Detailed content of the Lecture:

Describing Data with Tables and Graphs


Frequency Distributions for Quantitative Data
 A frequency distribution is a
collection of observations
produced by sorting
observations into classes and
showing their frequency (f) of
occurrence in each class.
 When observations are sorted
into classes of single values, as in
Table 2.1, the result is referred to
as a frequency distribution for
ungrouped data.
 The frequency distribution shown in
Table 2.1 is only partially displayed
because there are more than 100
possible values between the largest
and smallest observations.
Frequency distribution table is much more informative if possible observed
values is less then 20. If more entry is observed then grouped data is used

Grouped Data
According to their frequency of occurrence.
When observations are sorted into classes of
more than one value result is referred to as a
frequency.

 The general structure of this frequency


distribution is the data’s are grouped into
class intervals with 10 possible values
each.
 The frequency ( f ) column shows the frequency of observations in
each class and, at the bottom, the total number of observations in all classes
Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS
L13

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science


Course Faculty : [Link]
Unit : II – Describing Data Date of Lecture:
Topic of Lecture: Describing Data with Graphs

Introduction :
A graph is a visual representation of numerical data. Graphs provide a visual way to
summarize complex data and to show the relationship between different variables or
sets of data. Graphs are also an excellent way to demonstrate trends and
relationships within the data.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Data structures

 DBMS
Detailed content of the Lecture:

Data can be described clearly and concisely with the aid of a well-constructed
frequency distribution. And data can often be described even more vividly by
converting frequency distributions into graphs.

GRAPHS FOR QUANTITATIVE DATA

Histograms
A bar-type graph for quantitative data. The common boundaries between
adjacent bars emphasize the continuity of the data, as with continuous variables.
A histogram is a display of statistical information that uses rectangles to show the
frequency of data items in successive numerical intervals of equal size.

Important features of histograms


 Equal units along the horizontal axis (the X axis, or abscissa) reflect the
various class intervals of the frequency distribution.
 Equal units along the vertical axis (the Y axis, or ordinate) reflect
increases in frequency. (The units along the vertical axis do not have to
be the same width as those along the horizontal axis.)
 The intersection of the two axes defines the origin at which both
numerical scales equal 0.
 Numerical scales always increase from left to right along the horizontal
axis and from bottom to top along the vertical axis
 The body of the histogram consists of a series of bars whose heights
reflect the frequencies for the various classes.
 The adjacent bars in histograms have common boundaries that
emphasize the continuity of quantitative data for continuous variables.
 The introduction of gaps between adjacent bars would suggest an
artificial disruption in the data more appropriate for discrete quantitative
variables or for qualitative variables.
Using Graphs and Charts to Illustrate Quantitative Data

Using visual representations to present data from Indicators for School Health,
(SLIMS), surveys, or other evaluation activities makes them easier to understand.
Bar graphs, pie charts, line graphs, and histograms are an excellent way to illustrate
your program results. This brief includes concepts and definitions, types of graphs
and charts, and guidelines for [Link] of Graphs and Charts • A bar graph
is composed of discrete bars that represent different categories of data. The length
or height of the bar is equal to the quantity within that category of data. Bar graphs
are best used to compare values across categories.

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L14

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : II – Describing Data Date of Lecture:

Topic of Lecture: Describing data with averages

Introduction :
The mode reflects the value of the most frequently occurring [Link]
median reflects the middle value when observations are ordered from least
to most. The range is the difference between the largest and smallest
scores.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Statistics

 averages
Detailed content of the Lecture:
Describing Data with Averages
MODE
The mode reflects the value of the
most frequently occurring score. In
other words
A mode is defined as the value that has a higher frequency in a given set
of values. It is the value that appears the most number of times.
Example:
In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it
has appeared in the set twice.
Types of Modes
Bimodal, Trimodal & Multimodal (More than one mode)

 When there are two modes in a data set, then the set is called bimodal
For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because
both 2 and 5 is repeated three times in the given set.
 When there are three modes in a data set, then the set is called
trimodal
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
 When there are four or more modes in a data set, then the set is called
multimodal

Example: The following table represents the number of wickets taken by


a bowler in 10 matches. Find the mode of the given set of data.

It can be seen that 2 wickets were taken by the bowler frequently in


different matches. Hence, the mode of the given data is 2.
MEDIAN
The median reflects the middle value when observations are ordered from least
to most.
The median splits a set of ordered observations into two equal parts, the upper
and lower halves.

Finding the Median


 Order scores from least to most.
 If the total number of observation given is odd, then the
formula to calculate the median is: Median =
{(n+1)/2}th term / observation
 If the total number of observation is
even, then the median formula is:
Median = 1/2[(n/2)th term +
{(n/2)+1}th term ]

Example 1:

Find the median of the following:


4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29
Solution:
n= 15
When we put those numbers in the order we have:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
Median = {(n+1)/2}th term
= (15+1)/2
=8
The 8th term in the list is 24
The median value of this set of numbers is 24.

Example 2:
Find the median of the following:
9,7,2,11,18,12,6,4
Solution n=8
When we put those numbers in the order we have:
2, 4, 6, 7, 9,11, 12, 18
Median = 1/2[(n/2)th term + {(n/2)+1}th term ]
= ½ [(8/2) term + ((8/2)+1)term]
=1/2[4th term+5th term] (in our list 4th term is 7 and 5th term is 9)
= ½[7+9]
=1/2(16)
=8
The median value of this set of numbers is 8.
MEAN
The mean is found by adding all scores and then dividing by the number of
scores.
Mean is the average of the given numbers and is calculated by dividing the
sum of given numbers by the total number of numbers.

Types of means
 Sample mean
 Population mean

Sample Mean
The sample mean is a central tendency measure. The arithmetic average
is computed using samples or random values taken from the population. It
is evaluated as the sum of all the sample variables divided by the total
number of variables.

Population Mean

The population mean can be calculated by the sum of all values in the given
data/population divided by a total number of values in the given
data/population.

AVERAGES FOR QUALITATIVE AND RANKED DATA


Mode
The mode always can be used with qualitative data.
Median
The median can be used whenever it is possible to order qualitative data from least
to most because the level of measurement is ordinal
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347

Course Faculty

Verified by HoD
L15
LECTURE HANDOUTS

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : II – Describing Data Date of Lecture:


Topic of Lecture: Describing Variability

Introduction :

The range is the difference between the largest and smallest scores.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Statistics

 averages
Detailed content of the Lecture:

RANGE

The range is the difference between the largest and smallest scores.
The range in statistics for a given data set is the difference between the
highest and lowest values. For example, if the given data set is {2,5,8,10,3},
then the range will be 10 – 2 = 8.

Example 1: Find the range of given observations: 32, 41,

28, 54, 35, 26, 23, 33, 38, 40. Solution: Let us first arrange

the given values in ascending order.


23, 26, 28, 32, 33, 35, 38, 40, 41, 54
Since 23 is the lowest value and 54 is the highest value, therefore, the
range of the observations will be; Range (X) = Max (X) – Min (X)
= 54 – 23
= 31

VARIANCE
Variance is a measure of how data points differ from the mean. A variance
is a measure of how far a set of data (numbers) are spread out from their
mean (average) value.

Formula
σ = Σ(x-μ)2 or
Variance = (Standard deviation)2= σ2 = > σ 2= Σ(x-μ)2 /n

the values of all scores must be added and then divided by

the total number of scores. Example


X = 5, 8, 6, 10, 12, 9, 11, 10, 12, 7
Solution:
Mean= sum(x)/n
N=10
Sum(x)= X = 5+ 8+ 6+10+12+9+11+10

= 90
Mean
=> μ
= 90 /
10 = 9
Deviati
on
from
mean
x- μ = -4, -1, -3, 1, 3, 0, 2,1,3,-2

(x-μ)2 = 16,1,9,1,9,0,4,1,9,4

Σ(x-μ)2 = 16+1+9+1+9+0+4+1+9+4
=54

σ 2= Σ(x-μ)2 /n

=54/10
= 5.4
The standard deviation, the square root of the mean of all squared
deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard)
amount by which scores deviate

Standard Deviation: A Measure of Distance


The mean is a measure of position, but the standard deviation is a measure
of distance (on either side of the mean of the distribution).

Sum of Squares (SS)


Calculating the standard deviation requires that we obtain first a value for
the variance. However, calculating the variance requires, in turn, that we
obtain the sum of the squared deviation scores.
The sum of squared deviation scores or more simply the sum of squares,
symbolized by SS
“The sum of squares equals the sum of all squared deviation scores.” You can
reconstruct this formula by remembering the following three steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a
deviation score, X − μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ)2.

DEGREES OF FREEDOM (df)


 Degrees of freedom (df) refers to the number of values that are free
to vary, given one or more mathematical restrictions, in a sample
being used to estimate a population characteristic.
 Degrees of freedom are the number of independent variables that
can be estimated in a statistical analysis. These values of these
variables are without constraint, although the values do impost
restrictions on other variables if the data set is to comply with
estimate parameters.
 Degrees of Freedom (df ) The number of values free to vary, given
one or more mathematical restrictions.

Formula
Degree of freedom df = n-1

Example
Consider a data set consists of five positive integers. The sum of the five
integers must be the multiple of 6. The values are randomly selected as 3,
8, 5, and 4.
The sum of this for values is 20. So we have to choose the fifth integer to
make the sum divisible by 6. Therefore the fifth element is 10.

The number of degrees of Degrees of Freedom (df ) The number of values


free to vary, given one or more mathematical restrictions. Freedom—in the
numerator, as in the formulas for s2 and s. In fact, we can use degrees of
freedom to rewrite the formulas for the sample variance and standard

deviation:

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L16

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : II – Describing Data Date of Lecture:


Topic of Lecture: Interquartile Range
Introduction :
The interquartile range (IQR), is simply the range for the middle 50 percent of the
scores. More specifically, the IQR equals the distance between the third quartile (or
75th percentile) and the first quartile (or 25 th percentile), that is, after the highest
quarter (or top 25 percent) and the lowest quarter (or bottom 25 percent) have been
trimmed from the original set of scores

Prerequisite knowledge for Complete understanding and learning of Topic:

 Basic Mathematics

 Statistics
Detailed content of the Lecture:

INTERQUARTILE RANGE (IQR)

The interquartile range (IQR), is simply the range for the middle 50
percent of the scores. More specifically, the IQR equals the distance between the
third quartile (or 75th percentile) and the first quartile (or 25 th percentile), that
is, after the highest quarter (or top 25 percent) and the lowest quarter (or
bottom 25 percent) have been trimmed from the original set of scores. Since
most distributions are spread more widely in their extremities than their middle,
the IQR tends to be less than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from
lowest to highest. To find the interquartile range (IQR), first find the median (middle
value) of the lower and upper half of the data. These values are quartile 1 (Q1) and
quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
THE NORMAL CURVE
The normal distribution is a continuous probability distribution that is
symmetrical on both sides of the mean, so the right side of the center is a
mirror image of the left side.
Properties of the Normal Curve
 The normal curve is a theoretical curve defined for a continuous
variable, as described in Section 1.6, and noted for its symmetrical
bell-shaped form, as revealed in below figure
 Because the normal curve is symmetrical, its lower half is the mirror
image of its upper half.
 The normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the
peak (without actually touching the horizontal axis, since, in theory,
the tails of a normal curve extend infinitely far).
Properties of a normal distribution
The mean, mode and median are all equal.
The curve is symmetric at the center (i.e. around the mean, μ).
Exactly half of the values are to the left of center and exactly half the
values are to the right.
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]

Important Books/Journals for further learning including the page nos.:


 Data science from scratch :author joelgrus page nos: 403
 Essential math of data science, author Thomas neild ppage nos:347

Cou
rse Faculty

V
erified by HoD
LECTURE HANDOUTS L17

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : II – Describing Data Date of Lecture:


Topic of Lecture: Normal Distributions

Introduction :

The normal curve is a theoretical curve defined for a continuous variable, and
noted for its symmetrical bell-shaped form
 The normal curve peaks above a point midway along the horizontal spread and
then tapers off gradually in either direction from the peak
Prerequisite knowledge for Complete understanding and learning of Topic:

 Basic Mathematics
 Statistics

 Discrete Mathematics
Detailed content of the Lecture:

THE NORMAL CURVE


The normal distribution is a continuous probability distribution that is
symmetrical on both sides of the mean, so the right side of the center is a
mirror image of the left side.

Properties of the Normal Curve


 The normal curve is a theoretical curve defined for a continuous
variable, as described in Section 1.6, and noted for its symmetrical
bell-shaped form, as revealed in below figure
 Because the normal curve is symmetrical, its lower half is the mirror
image of its upper half.
 The normal curve peaks above a point midway along the horizontal
spread and then tapers off gradually in either direction from the peak
(without actually touching the horizontal axis, since, in theory, the
tails of a normal curve extend infinitely far).
 The values of the mean, median (or 50th percentile), and mode,
located at a point midway along the horizontal spread, are the same
for the normal curve.

Properties of a normal distribution


 The mean, mode and median are all equal.
 The curve is symmetric at the center (i.e. around the mean, μ).
 Exactly half of the values are to the left of center and exactly half the
values are to the right.

Different Normal Curve


As a theoretical exercise, it is instructive to note the various types of
normal curves that are produced by an arbitrary change in the value of
either the mean (μ) or the standard deviation (σ).
Obvious differences in appearance among normal curves are less
important than you might suspect. Because of their common mathematical
origin, every normal curve can be interpreted in exactly the same way once
any distance from the mean is expressed in standard deviation units.

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS
L18

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : II – Describing Data Date of Lecture:


Topic of Lecture: Standard Z-scores

Introduction :

A z score can be defined as a measure of the number of standard deviations


by which a score is below or above the mean of a distribution. In other words, it
is used to determine the distance of a score from the mean. If the z score is
positive it indicates that the score is above the mean. If it is negative then the
score will be below the mean. However, if the z score is 0 it denotes that the
data point is the same as the mean.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Probability
 Statistics

 Mathematics
Detailed content of the Lecture:

z SCORES

A z score is a unit-free, standardized score that, regardless of the original


units of measurement, indicates how many standard deviations a score is
above or below the mean of its distribution.
A z score can be defined as a measure of the number of standard
deviations by which a score is below or above the mean of a distribution.
In other words, it is used to determine the distance of a score from the
mean. If the z score is positive it indicates that the score is above the
mean. If it is negative then the score will be below the mean. However, if
the z score is 0 it denotes that the data point is the same as the mean.
To obtain a z score, express any original score, whether measured in
inches, milliseconds, dollars, IQ points, etc., as a deviation from its mean
(by subtracting its mean) and then split this deviation into standard
deviation units (by dividing by its standard deviation),
Where X is the original score and μ and σ are the mean and the standard
deviation, respectively, for the normal distribution of the original scores.
Since identical units of measurement appear in both the numerator

STANDARD NORMAL CURVE


If the original distribution approximates a normal curve, then the shift to
standard or z scores will always produce a new distribution that
approximates the standard normal curve. This is the one normal curve for
which a table is actually available.

Although there is an infinite number of different normal curves, each with


its own mean and standard deviation, there is only one standard normal
curve, with a mean of 0 and a standard deviation of 1.

For a standard normal curve


Mean = 0

Standard deviation = 1

Standard Normal Table

The standard normal table consists of columns of z scores coordinated with


columns of proportions

Using the Top Legend of the Table


Notice that columns are arranged in sets of three, designated as A, B, and
C in the legend at the top of the table. When using the top legend, all
entries refer to the upper half of the standard normal curve. The entries in
column A are z scores, beginning with 0.00 and ending with 4.00

Using the Bottom Legend of the Table


Now the columns are designated as A′, B′, and C′ in the legend at the
bottom of the table. When using the bottom legend, all entries refer to the
lower half of the standard normal curve.
A negative z score, columns B′ and C′ indicate how that z score splits the
lower half of the normal curve. As suggested by the shading in the bottom
legend of the table, column B′ indicates the proportion of area between
the mean and the negative z score, and column C′ indicates the proportion
of area beyond the negative z score, in the lower tail of the standard
normal curve.

FINDING SCORES
So far, we have concentrated on normal curve problems for which
Table A must be consulted to find the unknown proportion (of area)
associated with some known score or pair of known scores
Now we will concentrate on the opposite type of normal curve problem
for which Table A must be consulted to find the unknown score or scores
associated with some known proportion.
For this type of problem requires that we reverse our use of Table A
by entering proportions in columns B, C, B′, or C′ and finding z scores listed in
columns A or A′.
FINDING PROPORTIONS

Finding Proportions for One Score


 Sketch a normal curve and shade in the target area,
 Plan your solution according to the normal table.
 Convert X to z.

Finding Proportions between Two Scores


 Sketch a normal curve and shade in the target area,
(example, find proportion between 245 to 255)
 Plan your solution according to the normal table.
 Convert X to z by expressing 255 as

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L19

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:

Topic of lecture: Correlation, Scatter Plots


Introduction :

Correlation
Correlation refers to a process for establishing the relationships between
two variables. You learned a way to get a general idea about whether or not
two variables are related, is to plot them on a “scatter plot”. While there are
many measures of association for variables which are measured at the
ordinal or higher level of measurement, correlation is the most commonly
used approach.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Probability
 Statistics

 Mathematics
Detailed content of the Lecture:

Types of Correlation
 Positive Correlation – when the values of the two variables move in
the same direction so that an increase/decrease in the value of one
variable is followed by an increase/decrease in the value of the other
variable.
 Negative Correlation – when the values of the two variables move
in the opposite direction so that an increase/decrease in the value of
one variable is followed by decrease/increase in the value of the other
variable.

Correlations are useful because they can indicate a predictive relationship


that can be exploited in practice. For example, an electrical utility may
produce less power on a mild day based on the correlation between
electricity demand and weather. In this example, there is a cause’s people
to use more electricity for heating or cooling. However, in general, the
presence of a correlation is not sufficient to infer the presence of a causal
relationship.
No Correlation – when there is no linear dependence or no relation between the two

variables.

SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all
pairs of scores. In other words Scatter plots are the graphs that present the
relationship between two variables in a data-set. It represents data points on a
two-dimensional plane or on a Cartesian system.

Construction of scatter plots


 The independent variable or attribute is plotted on the X-axis. Fig 6.1
 The dependent variable is plotted on the Y-axis.

Positive, Negative, or Little or No Relationship?

The first step is to note the tilt or slope, if any, of a dot cluster.
A dot cluster that has a slope from the lower left to the upper right, as in
panel A of below figure reflects a
positive relationship.

A dot cluster that has a slope from the upper left to the lower right, as in
panel B of below figure reflects a
negative relationship.
A dot cluster that lacks any apparent slope, as in panel C of below figure reflects
little or no relationship.

A dot cluster that equals (rather than merely approximates) a straight line
reflects a perfect relationship between two variables.

Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight
line and, therefore, reflects a linear relationship. But this is not always
the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L20

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:

Topic of lecture: Correlation Coefficient for Quantitative Data

Introduction :
The correlation coefficient, r, is a summary measure that describes the
extent of the statistical relationship between two interval or ratio level
variables.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Statistics

 DBMS
Detailed content of the Lecture:

Properties of r
 The correlation coefficient is scaled so that it is always between -1 and +1.
 When r is close to 0 this means that there is little relationship
between the variables and the farther away from 0 r is, in either the
positive or negative direction, the greater the relationship between
the two variables.
 The sign of r indicates the type of linear relationship, whether positive or
negative.
 The numerical value of r, without regard to sign, indicates the strength of
the linear relationship.
 A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a negative
relationship

COMPUTATION FORMULA FOR r


Calculate a value for r by using the following computation formula:
Where the two sum of squares terms in the denominator are defined as
The sum of the products term in the numerator, SPxy, is defined in below
formula

Or the formula is written as

Where n = Number of Information


Σx = Total of the
First Variable
Value Σy = Total
of the Second
Variable Value

Σxy = Sum of the Product


of first & Second Value Σx2
= Sum of the Squares of
the First Value

Σy2 = Sum of the Squares of the Second Value

The objective of quantitative research is to develop and employ mathematical


models, theories, and hypotheses pertaining to phenomena. The process
of measurement is central to quantitative research because it provides the
fundamental connection between empirical observation and mathematical
expression of quantitative relationships.
Quantitative data is any data that is in numerical form such as statistics,
percentages, etc.[4] The researcher analyses the data with the help of statistics and
hopes the numbers will yield an unbiased result that can be generalized to some
larger population. Qualitative research, on the other hand, inquires deeply into
specific experiences, with the intention of describing and exploring meaning through
text, narrative, or visual-based data, by developing themes exclusive to that set of
participants
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS
L21

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:

Topic of lecture: Computation formula for Correlation Coefficient

Introduction :

The correlation coefficient, r, is a summary measure that describes the


extent of the statistical relationship between two interval or ratio level
variables.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Statistics

 DBMS
Detailed content of the Lecture:

Properties of r
 The correlation coefficient is scaled so that it is always between -1 and +1.
 When r is close to 0 this means that there is little relationship
between the variables and the farther away from 0 r is, in either the
positive or negative direction, the greater the relationship between
the two variables.
 The sign of r indicates the type of linear relationship, whether positive or
negative.
 The numerical value of r, without regard to sign, indicates the strength of
the linear relationship.
 A number with a plus sign (or no sign) indicates a positive
relationship, and a number with a minus sign indicates a negative
relationship

COMPUTATION FORMULA FOR r


Calculate a value for r by using the following computation formula:
Where the two sum of squares terms in the denominator are defined as

The sum of the products term in the numerator, SPxy, is defined in below
formula

Or the formula is written as

Where n = Number of Information


Σx = Total of the
First Variable
Value Σy = Total
of the Second
Variable Value

Σxy = Sum of the Product


of first & Second Value Σx2
= Sum of the Squares of
the First Value

Σy2 = Sum of the Squares of the Second Value

The objective of quantitative research is to develop and employ mathematical


models, theories, and hypotheses pertaining to phenomena. The process
of measurement is central to quantitative research because it provides the
fundamental connection between empirical observation and mathematical expression
of quantitative relationships.
Quantitative data is any data that is in numerical form such as statistics,
percentages, etc.[4] The researcher analyses the data with the help of statistics and
hopes the numbers will yield an unbiased result that can be generalized to some
larger population. Qualitative research, on the other hand, inquires deeply into
specific experiences, with the intention of describing and exploring meaning through
text, narrative, or visual-based data, by developing themes exclusive to that set of
participants
Views regarding the role of measurement in quantitative research are somewhat
divergent. Measurement is often regarded as being only a means by which
observations are expressed numerically in order to investigate causal relations or
associations. However, it has been argued that measurement often plays a more
important role in quantitative research. For example, Kuhn argued that within
quantitative research, the results that are shown can prove to be strange. This is
because accepting a theory based on results of quantitative data could prove to be a
natural phenomenon. He argued that such abnormalities are interesting when done
during the process of obtaining data.

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty
Verified by HoD

LECTURE HANDOUTS L22

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:

Topic of Lecture: Regression, Regression Line


Introduction :

A regression is a statistical technique that relates a dependent variable to one or


more independent (explanatory) variables. A regression model is able to show
whether changes observed in the dependent variable are associated with changes
in one or more of the explanatory variables.
Regression captures the correlation between variables observed in a data set, and
quantifies whether those correlations are statistically significant or not.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Distribution concepts
 Probability

 Statistics
Detailed content of the Lecture:

A Regression Line
A regression line is a line that best describes the behavior of a set of data.
In other words, it’s a line that best fits the trend of a given data.
The purpose of the line is to
describe the interrelation of a
dependent variable (Y variable)
with one or many independent
variables (X variable). By using the
equation obtained from the
regression line an analyst can
forecast future behaviours of the
dependent variable by inputting
different values for the
independent ones.
Types of regression
The two basic types of regression are
 Simple linear regression
Simple linear regression uses one independent variable to explain or predict the
outcome of the dependent variable Y
 Multiple linear regression
Multiple linear regressions use two or more.
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:

 If the goal is error reduction in prediction or forecasting, linear regression can be


used to fit a predictive model to an observed data set of values of the response and
explanatory variables. After developing such a model, if additional values of the
explanatory variables are collected without an accompanying response value, the
fitted model can be used to make a prediction of the response.
 If the goal is to explain variation in the response variable that can be attributed
to variation in the explanatory variables, linear regression analysis can be applied
to quantify the strength of the relationship between the response and the
explanatory variables, and in particular to determine whether some explanatory
variables may have no linear relationship with the response at all, or to identify
which subsets of explanatory variables may contain redundant information about
the response.

Predictive Error
Prediction error refers to the difference between the predicted values
made by some model and the actual values.

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L23

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:

Topic of Lecture: Least Squares Regression Line

Introduction :

The placement of the regression line minimizes not the total predictive
error but the total squared predictive error, that is, the total for all squared
predictive errors. When located in this fashion, the regression line is often
referred to as the least squares regression [Link] Least Squares Regression
Line is the line that minimizes the sum of the residuals squared. The residual is
the vertical distance between the observed point and the predicted point, and it
is calculated by subtracting ˆy from y.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Probability
 Discrete Mathematics

 Statistics
Detailed content of the Lecture:

The Least Squares Regression Line is the line that minimizes the sum
of the residuals squared. The residual is the vertical distance between the
observed point and the predicted point, and it is calculated by subtracting
ˆy from y.

Formula
y’ = bx+a b – slope , a – y intercept
b= N Σ(xy) − Σx Σy

N Σ(x2) − (Σx)2
b = Σy − m Σx

N
In statistics, ordinary least squares (OLS) is a type of linear least
squares method for choosing the unknown parameters in a linear regression model
(with fixed level-one effects of a linear function of a set of explanatory variables) by
the principle of least squares: minimizing the sum of the squares of the differences
between the observed dependent variable (values of the variable being observed) in
the input dataset and the output of the (linear) function of the independent variable.
Geometrically, this is seen as the sum of the squared distances, parallel to the
axis of the dependent variable, between each data point in the set and the
corresponding point on the regression surface—the smaller the differences, the
better the model fits the data. The resulting estimator can be expressed by a simple
formula, especially in the case of a simple linear regression, in which there is a
single regressor on the right side of the regression equation.
Example

"x" "y"
2 4
3 5
5 7
7 10
9 15

Step 1: For each (x,y) calculate x2 and xy:


Step 1: For each (x,y) calculate x2 and xy:
X Y X2
Y2
2 4 4 8
3 5 9 15
5 7 25 35
7 10 49 70
9 15 81 135

x y y= error
1.518x +
0.305
2 4 3.34 −0.66
3 5 4.86 −0.14
5 7 7.89 0.8
9
7 10 10.93 0.9
3
9 15 13.97 −1.03
To predict the y value we can
assume any value for x. Assume
x = 8.
Then y = 1.518 x 8 + 0.305
= 12.45
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]

Important Books/Journals for further learning including the page nos.:


 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L24

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:


Topic of Lecture: Standard Error of Estimate

Introduction :

The standard error of the estimate is a measure of the accuracy of predictions.


The regression line is the line that minimizes the sum of squared deviations of
prediction (also called the sum of squares error), and the standard error of the
estimate is the square root of the average squared deviation
Prerequisite knowledge for Complete understanding and learning of Topic:

 Probability
 Discrete Mathematics

 Statistics
Detailed content of the Lecture:
The standard error of estimate and symbolized as s y | x, this estimate of predictive
error complies with the general format for any sample standard deviation, that is, the
square root of a sum of squares term divided by its degrees of freedom

Example
Calculate the standard error of estimate for the given X and Y values. X =
1,2,3,4,5 Y=2,4,5,4,5

Solution
Create five columns labeled x, y, y’, y – y’, ( y – y’)2 and N=5

x y x2 xy Y’= y-y’ ( y – y’)2


bx+a
1 2 1 2 2.8 -0.8 0.64
2 4 4 8 3.4 0.6 0.36
3 5 9 15 4.0 1 1
4 4 16 16 4.6 -0.6 0.36
5 5 25 25 5.2 -0.2 0.04
Σ( y –
Σx:15 Σy:20 Σx2:55 Σxy:66
y’)2
= 2.4

Note: for finding b value we have to find xy and x2, so add xy and x2 column
in table

b= 30/50 = 0.6

a = Σy − b Σx
N
= 20 – (0.6 x 15)
5
= 20 – 11
5
a=
9/5 =
2.2

SSy/x = √((y-y’)2 / n-2)

=√(2.4/3)

SSy/x = 0.894
Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
LECTURE HANDOUTS L25

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:

Topic of Lecture: Interpretation of R2


Introduction :

R-Squared (R² or the coefficient of determination) is a statistical


measure in a regression model that determines the proportion of variance in
the dependent variable that can be explained by the independent variable.
In other words, r-squared shows how well the data fit the regression model
(the goodness of fit).
R-squared can take any values between 0 to 1. Although the
statistical measure provides some useful insights regarding the regression
model, the user should not rely only on the measure in the assessment of a
statistical model.

Prerequisite knowledge for Complete understanding and learning of Topic:


 Probability
 Discrete Mathematics

 Statistics
Detailed content of the Lecture:

In addition, it does not indicate the correctness of the regression


model. Therefore, the user should always draw conclusions about the model
by analyzing r-squared together with the other variables in a statistical
model.
The most common interpretation of r-squared is how well the regression
model explains observed data.
MULTIPLE REGRESSION EQUATIONS
Multiple regression is a statistical technique applied on datasets dedicated to
draw out a relationship between one response or dependent variable and
multiple independent variables.
Multiple regression works by considering the values of the available multiple
independent variables and predicting the value of one dependent variable.

Example:
A researcher decides to study students’ performance from a school over a
period of time. He observed that as the lectures proceed to operate online,
the performance of students started to decline as well. The parameters for
the dependent variable “decrease in performance” are various independent
variables like “lack of attention, more internet addiction, and neglecting
studies” and much more.

Formula to find
multiple
y = b1x1 + b2x2 + … bnxn + a

REGRESSION TOWARD THE MEAN


Regression toward the mean refers to a tendency for scores, particularly
extreme scores, to shrink toward the mean.
In statistics, regression toward the mean (also called reversion to the mean,
and reversion to mediocrity) is a concept that refers to the fact that if one
sample of a random variable is extreme, the next sampling of the same
random variable is likely to be closer to its mean.

Example
A military commander has two units return, one with 20% casualties and
another with 50% casualties. He praises the first and berates the second.
The next time, the two units return with the opposite results. From this
experience, he “learns” that praise weakens performance and berating
increases performance.

The Regression Fallacy


The regression fallacy is committed whenever regression toward the
mean is interpreted as a real, rather than a chance, effect.
The regression fallacy can be avoided by splitting the subset of extreme
observations into two groups
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data scienc, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

L26
LECTURE HANDOUTS

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:


Topic of Lecture: Multiple regression Equation
Introduction :

Multiple regression is a statistical technique applied on datasets


dedicated to draw out a relationship between one response or dependent
variable and multiple independent variables.
Multiple regression works by considering the values of the available multiple
independent variables and predicting the value of one dependent variable
Prerequisite knowledge for Complete understanding and learning of Topic:

 Probability
 Discrete Mathematics
 Statistics
Detailed content of the Lecture:

MULTIPLE REGRESSION EQUATIONS


Multiple regression is a statistical technique applied on datasets dedicated to
draw out a relationship between one response or dependent variable and
multiple independent variables.
Multiple regression works by considering the values of the available multiple
independent variables and predicting the value of one dependent variable.

Example:
A researcher decides to study students’ performance from a school over a
period of time. He observed that as the lectures proceed to operate online,
the performance of students started to decline as well. The parameters for
the dependent variable “decrease in performance” are various independent
variables like “lack of attention, more internet addiction, neglecting
studies” and much more.

Formula to find
multiple r
y = b1x1 + b2x2 + … bnxn + a

Regression methods continue to be an area of active research. In recent


decades, new methods have been developed for robust regression, regression
involving correlated responses such as time series and growth curves, regression in
which the predictor (independent variable) or response variables are curves, images,
graphs, or other complex data objects, regression methods accommodating various
types of missing data, nonparametric regression, Bayesian methods for regression,
regression in which the predictor variables are measured with error, regression with
more predictor variables than observations, and causal inference with regression.

Once a regression model has been constructed, it may be important to confirm


the goodness of fit of the model and the statistical significance of the estimated
parameters. Commonly used checks of goodness of fit include the R-squared,
analyses of the pattern of residuals and hypothesis testing. Statistical significance
can be checked by an F-test of the overall fit, followed by t-tests of individual
parameters.
Interpretations of these diagnostic tests rest heavily on the model's
assumptions. Although examination of the residuals can be used to invalidate a
model, the results of a t-test or F-test are sometimes more difficult to interpret if the
model's assumptions are violated. For example, if the error term does not have a
normal distribution, in small samples the estimated parameters will not follow normal
distributions and complicate inference. With relatively large samples, however,
a central limit theorem can be invoked such that hypothesis testing may proceed
using asymptotic approximations.
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L27

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : III – Describing Relationships Date of Lecture:


Topic of Lecture: Regression towards the Mean
Introduction :
Regression toward the mean refers to a tendency for scores, particularly
extreme scores, to shrink toward the mean.
In statistics, regression toward the mean (also called reversion to the mean, and
reversion to mediocrity) is a concept that refers to the fact that if one sample of
a random variable is extreme, the next sampling of the same random variable is
likely to be closer to its mean.
Prerequisite knowledge for Complete understanding and learning of Topic:
 Probability
 Discrete Mathematics

 Statistics
Detailed content of the Lecture:

MULTIPLE REGRESSION EQUATIONS


Multiple regression is a statistical technique applied on datasets dedicated to
draw out a relationship between one response or dependent variable and multiple
independent variables.
Multiple regression works by considering the values of the available multiple
independent variables and predicting the value of one dependent variable.
Example:
A researcher decides to study students’ performance from a school over a
period of time. He observed that as the lectures proceed to operate online, the
performance of students started to decline as well. The parameters for the
dependent variable “decrease in performance” are various independent
variables like “lack of attention, more internet addiction, neglecting studies”
and much more.
Formula to find multiple
y = b1x1 + b2x2 + … bnxn + a
Example
A military commander has two units return, one with 20% casualties and
another with 50% casualties. He praises the first and berates the second. The
next time, the two units return with the opposite results. From this experience,
he “learns” that praise weakens performance and berating increases
performance.

The Regression Fallacy


The regression fallacy is committed whenever regression toward the
mean is interpreted as a real, rather than a chance, effect.
The regression fallacy can be avoided by splitting the subset of extreme
observations into two groups
Regression methods continue to be an area of active research. In recent
decades, new methods have been developed for robust regression, regression
involving correlated responses such as time series and growth curves, regression in
which the predictor (independent variable) or response variables are curves, images,
graphs, or other complex data objects, regression methods accommodating various
types of missing data, nonparametric regression, Bayesian methods for regression,
regression in which the predictor variables are measured with error, regression with
more predictor variables than observations, and causal inference with regression.
Once a regression model has been constructed, it may be important to confirm
the goodness of fit of the model and the statistical significance of the estimated
parameters. Commonly used checks of goodness of fit include the R-squared,
analyses of the pattern of residuals and hypothesis testing. Statistical significance
can be checked by an F-test of the overall fit, followed by t-tests of individual
parameters.
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]

Important Books/Journals for further learning including the page nos.:


 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L28

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]


Unit : IV – Python Libraries for Data Wrangling Date of Lecture:
Topic of Lecture: Basics of Numpy Arrays
Introduction :

NumPy (short for Numerical Python) provides an efficient interface to store


and operate on dense data buffers. NumPy arrays are like Python’s built-in
list type, but NumPy arrays provide much more efficient storage and data
operations as the arrays grow larger in size.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Python

 Mathematics
Detailed content of the Lecture:

Attributes of arrays
Determining the size, shape, memory consumption, and data types of
arrays

Indexing of arrays
Getting and setting the value of individual array elements

Slicing of arrays
Getting and setting smaller sub arrays within a larger array

Reshaping of arrays
Changing the shape of a given array

Joining and splitting of arrays


Combining multiple arrays into one, and splitting one array into many

NumPy Array Attributes


 ndim (the number of dimensions),
 shape (the size of each dimension)
 size (the total size of the array)

Example
[Link](0) # seed for reproducibility

x1 = [Link](10, size=6) # One-dimensional array

x2 = [Link](10, size=(3, 4)) # Two-dimensional array

x3 = [Link](10, size=(3, 4, 5)) # Three-dimensional array

print("x
3 ndim:
",
[Link]
)
print("x
3
shape:",
[Link]
e)
print("x
3 size:
",
[Link])
print("dtype:", [Link])
print("itemsize:",
[Link], "bytes")
print("nbytes:", [Link],
"bytes")

Array Indexing:
 Accessing Single Elements

Accessing Single Elements


 Indexing in NumPy will feel quite familiar like list indexing,
 In a one-dimensional array, you can access the ith value (counting from
zero) by specifying the desired index in square brackets, just as with Python
lists
 To index from the end of the array, you can use negative indices
 In a multidimensional array, you access items using a comma-separated
tuple of indices
 Unlike Python lists, NumPy arrays have a fixed type. This means, for
example, that if you attempt to insert a floating-point value to an
integer array, the value will be silently truncated.

Array Slicing: Accessing Subarrays


Just as we can use square brackets to access individual array elements, we
can also use them to access subarrays with the slice notation, marked by the
colon (:) character.
The NumPy slicing syntax follows that of the standard
Python list; to access a slice of an array x, use this:
x[start:stop:step]
start – starting array index
stop – array index to stop ( last
value will not be considered) step –
terms has to be printed from start
to stop
Default to the values start=0, stop=size of dimension, step=1.
Example
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

x[:5] # prints first five elements

array([0, 1, 2, 3, 4])

x[5:] # elements after index 5

array([5, 6, 7, 8, 9])

Multidimensional sub arrays


Multidimensional slices work in the same way, with multiple
slices separated by commas. For example:
x2
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
x2[:2, :3] # two
rows, three
columns
array([[12, 5, 2],
[ 7, 6, 8]])

x2[:3, ::2] # all rows, every other


column(every second column) array([[12,
2],
[ 7, 8],
[ 1, 7]])
Finally, sub array dimensions can even be reversed together
x2[::-1, ::-1]
array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])

Reshaping of Arrays
The most flexible way of doing this is with the reshape() method. For
example, if you want to put the numbers 1 through 9 in a 3×3 grid, you can
do the following
grid =
[Link](1,
10).reshape((3,
3)) print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]

Array Concatenation and Splitting

Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished
through the routines [Link], [Link], and [Link].
[Link] takes a tuple or list of arrays as its first argument. x =
[Link]([1, 2, 3])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once


z = [99, 99, 99]
print(np.c

oncatenate([x,

y, z])) [ 1 2 3 3 2

1 99 99 99]
[Link] can also be used for two-dimensional arrays
grid = [Link]([[1, 2, 3],
[4, 5, 6]])
[Link]([grid, grid])

array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])
Concatenate along the second axis (zero-indexed)
[Link]([grid, grid], axis=1)

array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])

[Link] (vertical stack) functions

x = [Link]([1, 2, 3])
grid = [Link]([[9, 8, 7],
[6, 5, 4]])
[Link]([x, grid])

array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])

[Link] (horizontal stack) functions

y = [Link]([[99],
[99]])
[Link]([grid, y])

array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])

Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions
[Link], [Link], and
[Link]. For each of these, we can pass a list of indices giving the split points
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2,
x3 =
[Link](
x, [3, 5])
print(x1,
x2, x3)

[1 2 3] [99 99] [3 2 1]
Notice that N split points lead to N + 1 subarrays. The related functions [Link]
and [Link] are similar
grid =
[Link](16).r
eshape((4, 4))
grid
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

upper, lower = [Link](grid, [2]


4
114 mod 13 = (112 mod 13 x 112 mod 13) mod 13
= (4 x 4) mod 13
= 16 mod 13
=3
117 mod 13 = (114 mod 13 x 112 mod 13 x 111 mod 13) mod 13
= (3 x 4 x 11) mod 13
= (132) mod 13
=2
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L29

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:


Topic of Lecture: Aggregations, Computation on NumPy Arrays: Universal Functions

Introducing UFuncs
NumPy provides a convenient interface into just this kind of statically typed,
compiled routine. This is known as a vectorized operation.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Mathematics

 Data base
Detailed content of the Lecture:

Vectorized operations in NumPy are implemented via ufuncs, whose main


purpose is to quickly execute repeated operations on values in NumPy
arrays. Ufuncs are extremely flexible—before we saw an operation between a
scalar and an array, but we can also operate between two arrays

Exploring NumPy’s UFuncs


Ufuncs exist in two flavors: unary ufuncs, which operate on a single input,
and binary ufuncs, which operate on two inputs. We’ll see examples of both
these types of functions here.

Array arithmetic
NumPy’s ufuncs make use of Python’s native arithmetic operators. The
standard addition, subtraction, multiplication, and division can all be used.
x = [Link](4)

print("x =", x)

print("x + 5 =", x + 5)

print("x - 5 =", x - 5)

Operator Equivalent Description


ufunc
+ [Link] Addition (e.g., 1 + 1 = 2)
- [Link] Subtraction (e.g., 3 - 2 =
1)
- [Link] Unary negation (e.g., -2)
* [Link] Multiplication (e.g., 2 * 3
= 6)
/ [Link] Division (e.g., 3 / 2 = 1.5)
// np.floor_divide Floor division (e.g., 3 // 2 = 1)

** [Link] Exponentiation (e.g., 2 ** 3 =


8)
% [Link] Modulus/remainder (e.g., 9 %
4 = 1)

Absolute value
Just as NumPy understands Python’s built-in arithmetic operators, it also
understands Python’s built-in absolute value function.
 [Link]()
 [Link]()
x = [Link]([-2, -1, 0, 1,
2]) abs(x)

The corresponding NumPy ufunc is [Link], which is also available under the
alias [Link]

[Link](x)
array([2, 1, 0, 1, 2])

Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful
for the data scientist are the trigonometric functions.
 [Link]()
 [Link]()
 [Link]()
inverse trigonometric functions
 [Link]()
 [Link]()
 [Link]()

Aggregations: Min, Max, and Everything in Between


Minimum and Maximum

Python has built-in min and max functions, used to find the minimum
value and maximum value of any given array.

For min, max, sum, and several other NumPy aggregates, a shorter syntax is
to use methods of the array object itself.

 [Link]() – finds the minimum (smallest) value in the array


[Link]() – finds the maximum (largest) value in the array

Multidimensional aggregates
One common type of aggregation operation is an aggregate along a row or
column.
By default, each NumPy aggregation function will return the aggregate over the
entire array. ie. If we use the [Link]() it will calculates the sum of all elements of
the array.
Example
m = [Link]((3, 4))
print(M)

[[ 0.8967576 0.75952519 0.0668282


0.03783739 7]
[ 0.8354065 0.19544769 0.4344708
4]
0.99196818
[ 0.66859307 0.37911423 0.6687194
0.15038721 ]]

Aggregation functions take an additional argument specifying the axis along


which the aggregate is computed. The axis normally takes either 0 or 1. if
the axis = 0 then it runs along with columns, if axis =1 it runs along with
rows.
Example
We can find the minimum value within each column by specifying axis=0
[Link](axis=0)
array([ 0.66859307, 0.03783739, 0.19544769, 0.06682827])

Computation on Arrays: Broadcasting


Broadcasting is simply a set of rules for applying binary ufuncs
(addition, subtraction, multiplication, etc.) on arrays of different sizes.
For arrays of the same size, binary operations are performed on an
element-by-element basis.
a = [Link]([0, 1, 2])
array([5, 6, 7])
Broadcasting allows these types of binary operations to be
performed on arrays of different sizes. a + 5
array([5, 6, 7])
Exponentiation

 Exponentiation is a type of operation where two elements are used in which one
element is considered as a base element and another as an exponential
element.
 For example, b is an example of exponential operation where x is a base
element
and y is an exponential element.
 When y is a positive integer, exponentiation is performed in a similar way to
repeated
multiplication is performed.
 Modular exponentiation is a type of exponentiation in which a modulo division
operation is performed after performing an exponentiation operation.

Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction
between the two arrays.
• Rule 1: If the two arrays differ in their number of dimensions, the shape
of the one with fewer dimensions is padded with ones on its
leading (left) side.
• Rule 2: If the shape of the two arrays does not match in any dimension, the
array with shape equal to 1 in that dimension is stretched to match
the other shape.
• Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error
is raised
Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS

CSE L30
II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:

Topic of Lecture: Comparisons, Masks, and Boolean Logic


Introduction :

Comparison Operators as ufuncs.


We saw that using +, -, *, /, and others on arrays leads to element-wise operations. NumPy also
implements comparison operators such as < (less than) and > (greater than) as element-wise
ufuncs.
The result of these comparison operators is always an array with a Boolean data type.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Mathematics
 Probability
 Python
Detailed content of the Lecture:

x=
[Link]
([1, 2, 3,
4, 5]) x
<3#
less
than

array([ True, True, False, False,


False], dtype=bool) x > 3 #
greater than
array([False, False, False, True, True], dtype=bool)

x <= 3 # less than or equal

array([ True, True, True, False, False], dtype=bool)

x >= 3 # greater than or equal

array([False, False, True, True, True], dtype=bool)


x != 3 # not equal

array([ True, True, False, True, True], dtype=bool)

x == 3 # equal
array([False, False, True, False, False], dtype=bool)

Operator Equivalent ufunc


== [Link]
!= np.not_equal
< [Link]
<= np.less_equal
> [Link]
>= np.greater_equal

Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape.
Here is a two- dimensional example

rng =
[Link]
domState(0) x
=
[Link](10,
size=(3, 4))
x
array([[5, 0,
3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])

Working with Boolean Arrays


 np.count_nonzero()
 [Link]()
 [Link](x , axis)
 [Link]()
 [Link]()
 [Link](x , axis)

Boolean operators
Operator Equivalent ufunc &
np.bitwise_and
| np.bitwise_or
^ np.bitwise_xor
~ np.bitwise_not

Boolean Arrays as Masks


A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the
data themselves. Returning to our x array from before, suppose we want an array of all values
in the array that are less than, say, 5
We can obtain a Boolean array for this condition easily, as
we’ve already seen Example
x
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])

x<5
array([[False,
True,
True,
True],
[False,
False,
True,
False],
[ True, True, False, False]], dtype=bool)

Masking operation
To select these values from the array, we can simply index on this Boolean array; this is known as
a masking operation.
x[x < 5]
array([0, 3, 3, 3, 2, 4])
What is returned is a one-dimensional array filled with all the values that meet this condition;
in other words, all the values in positions at which the mask array is True.

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS
L31

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:

Topic of Lecture: Fancy Indexing, Structured Arrays


Introduction :
Fancy indexing is like the simple indexing we’ve already seen, but we
pass arrays of indices in place of single scalars. This allows us to very quickly
access and modify complicated subsets of an array’s values.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Data structure
 Python
 Database
Detailed content of the Lecture:

Types of fancy indexing.


 Indexing / accessing more values
 Array of indices
 In multi dimensional
 Standard indexing
Example

import numpy as np
rand = [Link](42) x = [Link](100,
size=10) print(x)
[51 92 14 71 60 20 82 86 74 74]

Indexing / accessing more values


Suppose we want to access three different
elements. We could do it like this: [x[3], x[7],
x[2]]

[71, 86, 14]

Array of indices
We can pass a single list or array of indices
to obtain the same result. ind = [3, 7,
4]
x[ind]

array([71, 86, 60])

Multi dimensional
Fancy indexing also works in multiple dimensions. Consider the following array.
X=
[Link](12).reshape((3,
4)) Xarray([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

Standard indexing
Like with standard indexing, the first index refers to the row,
and the second to the column. row = [Link]([0, 1,
2])
col = [Link]([2, 1, 3])

Combined Indexing
For even more powerful operations, fancy indexing can be combined with the
other indexing schemes we’ve seen.
Example array
print(X)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
 Combine fancy and simple indices
X[2, [2, 0, 1]]
array([10, 8, 9])
 [Link]([1, 0, 1,
0], dtype=bool)
X[row[:,
[Link]],
mask]

array([[ 0, 2],
[ 4, 6],
[ 8, 10]])

SORTING ARRAYS
Python has built-in sort and sorted functions to work with lists, we won’t
discuss them here because NumPy’s [Link] function turns out to be much
more efficient and useful for our purposes. By default [Link] uses an O[ N log
N], quicksort algorithm, though mergesort and heapsort are also available.
For most applications, the default quicksort is more than sufficient.

Sorting without modifying the input.


To return a sorted version of the array without modifying the input, you can use
[Link]
x = [Link]([2, 1, 4,
3, 5]) [Link](x)
array([1, 2, 3, 4, 5])

Returns sorted indices


A related function is argsort, which instead returns the indices of the sorted
elements
x = [Link]([2, 1, 4,
3, 5]) i =
[Link](x)
print(i)

Sorting along rows or columns


A useful feature of NumPy’s sorting algorithms is the ability to sort along specific
rows or columns of a multidimensional array using the axis argument. For
example
rand = [Link](42) X = [Link](0,
10, (4, 6)) print(X)
[[6 3 7 4 6 9]
[2 6 7 4 3 7]
[7 2 5 4 1 7]
[5 1 4 0 9 5]]

Partial Sorts: Partitioning


Sometimes we’re not interested in sorting the entire array, but simply want
to find the K smallest values in the array. NumPy provides this in the
[Link] function. [Link] takes an array and a number K; the result is
a new array with the smallest K values to the left of the partition, and the
remaining values to the right, in arbitrary order
x = [Link]([7, 2, 3, 1, 6, 5, 4])
[Link](x, 3)

array([2, 1, 3, 4, 6, 5, 7])
Note that the first three values in the resulting array are the three smallest in
the array, and the remaining array positions contain the remaining values.
Within the two partitions, the elements have arbitrary order.

Partitioning in multidimensional array


Similarly to sorting, we can partition along an arbitrary axis of a multidimensional
array.
[Link](X, 2, axis=1)

array([[3, 4, 6, 7, 6, 9],
[2, 3, 4, 7, 6, 7],
[1, 2, 4, 5, 7, 7],
[0, 1, 4, 5, 9, 5]])

Structured Arrays
This section demonstrates the use of NumPy’s structured arrays and record
arrays, which provide efficient storage for compound, heterogeneous data.
NumPy data types
Character Description Example
'b' Byte [Link]('b')
'i' Signed integer [Link]('i4') == np.int32
'u' Unsigned integer [Link]('u1') == np.uint8
'f' Floating point [Link]('f8') == np.int64
'c' Complex floating point [Link]('c16') ==
np.complex128
'S', 'a' string [Link]('S5')
'U' Unicode string
[Link]('U') == np.str_ 'V' Raw
data (void) [Link]('V') ==
[Link]

Creating structured array


NumPy can handle this through structured arrays, which are arrays with
compound data types. create a structured array using a compound data type
specification as follows.
data = [Link](4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print([Link])

[('name', '<U10'),
('age', '<i4'), ('weight',
'<f8')] U10 - Unicode
string of maximum
length 10 i4 - 4-byte
(i.e., 32 bit) integer
f8 - 8-byte (i.e., 64 bit) float
Now we can fill the array with our lists of values
data['n
ame']
=
name
data['a
ge'] =
age
data['
weight'
]=
weight
print(
data)
[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)('Doug', 19, 61.5)]

Refer values through index or name


The handy thing with structured arrays is that you can now refer to values either
by index or by name.
i. data['name']# by name

array(['Alice', 'Bob', 'Cathy', 'Doug'],dtype='<U10')


[Link][0]# by index

('Alice', 25, 55.0)

Using Boolean masking


This allows to do some more sophisticated operations such as filtering on
any fields.
data[data['age'] < 30]['name']
array(['Alice', 'Doug'],dtype='<U10')
Creating
Structured Arrays
Dictionary method
[Link]({'names':('name', 'age', 'weight'),
'formats':('U10',

'i4', 'f8')}) dtype([('name',

'<U10'), ('age', '<i4'),

('weight', '<f8')])

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course
Faculty

Verified by HoD
LECTURE HANDOUTS L32

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:

Topic of Lecture: Data Manipulation with Pandas


Introduction :
Pandas is a newer package built on top of NumPy, and provides an
efficient implementation of a Data Frame. Data Frames are essentially
multidimensional arrays with attached row and column labels, and often with
heterogeneous types and/or missing data.

Prerequisite knowledge for Complete understanding and learning of Topic:

 Python
 Mathematics
 Database
Detailed content of the Lecture:

Introducing Pandas Objects


Pandas objects can be thought of as enhanced versions of NumPy structured
arrays in which the rows and columns are identified with labels rather than
simple integer indices.
Pandas provide a host of useful tools, methods, and functionality
on top of the basic data structures. Three fundamental Pandas
data structures: the Series, Data Frame, and Index

The Pandas Series Object


A Pandas Series is a one-dimensional array of indexed data. It can be created
from a list or array as follows:
data = [Link]([0.25, 0.5,
0.75, 1.0]) data

0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
 Finding values
The values are simply a familiar NumPy array
[Link]
array([ 0.25, 0.5 , 0.75, 1. ])
 Finding index
The index is an array-like object of type [Link]
[Link]
Series as generalized NumPy array
the NumPy array has an implicitly defined integer index used to access the
values, the Pandas Series has an explicitly defined index associated with the
values.
This explicit index definition gives the Series object additional capabilities.
For example, the index need not be an integer, but can consist of values of
any desired type.
For example, if we wish, we can use strings as an index.

Strings as an index
a 0.25
b 0.50
c 0.75
d 1.00
dtype: float64

Series as specialized dictionary


A dictionary is a structure that maps arbitrary keys to a set of arbitrary values,
and a Series is a structure that maps typed keys to a set of typed values.
just as the type-specific compiled code behind a NumPy array makes it more
efficient than a Python list for certain operations, the type information of a Pandas
Series makes it much more efficient than Python dictionaries for certain
operations.

Series as specialized dictionary


A dictionary is a structure that maps arbitrary keys to a set of arbitrary values,
and a Series is a structure that maps typed keys to a set of typed values.
just as the type-specific compiled code behind a NumPy array makes it more
efficient than a Python list for certain operations, the type information of a Pandas
Series makes it much more efficient than Python dictionaries for certain
operations.

Dictionary-style item access


Mark[‘ram’]
85
Array-style slicing
Mark[ ‘sai’:’kasim’]
sai 90
ram 85
kasim 92

The Pandas Data Frame Object


The fundamental structure in Pandas is the DataFrame. The DataFrame can
be thought of either as a generalization of a NumPy array, or as a
specialization of a Python dictionary.

DataFrame as a generalized NumPy array


A Data Frame is an analog of a two-dimensional array with both flexible row
indices and flexible column names. Just as you might think of a two-
dimensional array as an ordered sequence of aligned one-dimensional
columns, you can think of a DataFrame as a sequence of aligned Series
objects. Here, by “aligned” we mean that they share the same index.
To demonstrate this, let’s first construct a new Series listing the marks of
subject2.
sub2={'sai':91,'ram':95,'kasim':89,'tamil':90}

We can use a dictionary to construct a single two-dimensional object containing


this information.
result=[Link](
{'DS':sub1,'FDS':sub2
}) result

DS FD
S
sai 90 91
ram 85 95
kasi 92 89
m
tamil 89 90

DataFrame has an index attribute


Like the Series object, the DataFrame has an index attribute that gives access to
the index labels
[Link]

Index(['sai', 'ram', 'kasim', 'tamil'], dtype='object')

DataFrame has a columns attribute.


The DataFrame has a columns attribute, which is an Index object holding the
column labels.
[Link]

Index(['DS', 'FDS'], dtype='object')

DataFrame as specialized dictionary


We can also think of a DataFrame as a specialization of a dictionary. Where a
dictionary maps a key to a value, a DataFrame maps a column name to a
Series of column data.
result['DS']
sai 90
ram 85
kasim 92

Constructing DataFrame objects


A Pandas DataFrame can be constructed in a variety of ways. Here we’ll give
several examples.
 From a single Series object.
 From a list of dicts.
 From a dictionary of Series objects.
 From a two-dimensional NumPy array.
 From a NumPy structured array.
From a single Series object.

A DataFrame is a collection of Series objects, and a single column DataFrame


can be constructed from a single Series.
sub1=[Link]({'sai':90,'ram':85,'kasim':92,'tamil':89}
) [Link](sub1,columns=['DS'])
DS
sai 90
ram 85
kasim 92
tamil 89
The Pandas Index Object
We have seen here that both the Series and DataFrame objects contain an
explicit index that lets you reference and modify data. This Index object is an
interesting structure in itself, and it can be thought of either as an immutable
array or as an ordered set.
ind = [Link]([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11],
dtype='int64')

 Index as immutable array


The Index object in many ways operates like an array. For
example, we can use standard Python indexing notation
to retrieve values or slices.

 Index as ordered set


Pandas objects are designed to facilitate operations such as joins across
datasets, which depend on many aspects of set arithmetic.
The Index object follows many of the conventions used by Python’s built-in
set data structure, so that unions, intersections, differences, and other
combinations can be computed in a familiar way.
indA = [Link]([1, 3, 5, 7, 9])
indB = [Link]([2, 3, 5, 7,
11]) indA & indB #
intersectionInt64Index([3, 5,
7], dtype='int64') indA | indB
#

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272
Course Faculty

Verified by HoD

LECTURE HANDOUTS L33

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:


Topic of Lecture: Data Indexing and Selection
Introduction :
A Series object acts in many ways like a one dimensional NumPy array,
and in many ways like a standard Python dictionary. It will help us to
understand the patterns of data indexing and selection in these arrays.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Database
 Python
 Mathematics
Detailed content of the Lecture:

 Series as dictionary
Like a dictionary, the Series object provides a mapping from a collection
of keys to a collection of values.
data = [Link]([0.25, 0.5, 0.75, 1.0],
b 0.50
c 0.75
d 1.00
dtype: float64

data['b']

Examine the keys/indices and values


We can also use dictionary-like Python expressions and methods to examine the
keys/indices and values
i. [Link]()
Index(['a', 'b', 'c', 'd'], dtype='object')
ii. list([Link]())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

Modifying series object


Series objects can even be modified with a dictionary-like syntax. Just as you
can extend a dictionary by assigning to a new key, you can extend a Series by
assigning to a new index value.

a 0.25
b 0.50
c 0.75
d 1.00
e 1.25
dtype: float64

 Series as one-dimensional array


A Series builds on this dictionary-like interface and provides array-style item
selection via the same basic mechanisms as NumPy arrays—that is, slices,
masking, and fancy indexing

Slicing by explicit index


data['a':'c']

a 0.25
b 0.50
c 0.75
dtype: float64

Slicing by implicit integer index


data[0:2]
a 0.25
b 0.50
dtype: float64

Masking
data[(data > 0.3) & (data < 0.8)]

b 0.50

c 0.75

 Indexers: loc, iloc, and ix


Pandas provides some special indexer attributes that explicitly expose
certain indexing schemes. These are not functional methods, but
attributes that expose a particular slicing interface to the data in the
Series.
data = [Link](['a',
'b', 'c'], index=[1, 3,
5]) data
1a
3b
5c
dtype: object

loc - the loc attribute allows indexing and slicing that always references the
explicit index.
dtype: object
iloc - The iloc attribute allows indexing and slicing that always references the
implicit Python-style index.
dtype: object

ix- ix is a hybrid of the two, and for Series objects is equivalent to standard [ ]-
based indexing.
Data Selection in DataFrame
 DataFrame as a dictionary
 DataFrame as two-dimensional array
 Additional indexing conventions
DataFrame as a dictionary

The first analogy we will consider is the DataFrame as a dictionary of related


Series objects.
The individual Series that make up the columns of the DataFrame can be
accessed via dictionary-style indexing of the column name.
Dictionary-style indexing of
the column name.
result=[Link](
{'DS':sub1,'FDS':sub2}
) result[‘DS’]
DS
sai 90
ram 85
kasim 92
tamil 89

Attribute-style access with column names that are strings


[Link]
DS
sai 90
ram 85
kasim 92
tamil 89

Comparing attribute style and dictionary style accesses


[Link] is result[‘DS’]
Modify the object
Like with the Series objects this dictionary-style syntax can also be used to
modify the object, in this case to add a new column:
result[‘TOTAL’]=result[‘DS’]+result[‘FDS’]
result
DS FDS TOTAL
sai 90 91 181
ram 85 95 180
kasi 92 89 181
m
tamil 89 90 179
the full DataFrame to swap rows and columns.
result.T

 Masking and Fancy indexing


In the loc indexer we can combine masking and fancy indexing as in the
following:
[Link][[Link]>180,[ ‘DS’, ‘FDS’ ]]
DS FD
S
sai 90 91
kasim 92 89

 Modifying values
Indexing conventions may also be used to set or modify values; this
is done in the standard way that you might be accustomed to from
working with NumPy.
[Link][1,1] =70
DS FDS TOTAL
sai 90 91 181
ram 85 70 180
kasi 92 89 181
m
tamil 89 90 179

result['sai':'kasim']

Masking row wise


result[[Link]>180]

DS FDS TOTAL
sai 90 91 181
kasim 92 89 181

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, author Thomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272
Course Faculty

Verified by HoD

LECTURE HANDOUTS L34

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:


Topic of Lecture: Operating on Data, Missing data
Introduction :
Pandas inherits much of this functionality from NumPy, and the ufuncs. So Pandas
having the ability to perform quick element-wise operations, both with basic
arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated
operations (trigonometric functions, exponential and logarithmic functions,
etc.).For unary operations like negation and trigonometric functions, these ufuncs
will preserve index and column labels in the output
Prerequisite knowledge for Complete understanding and learning of Topic:
 Database
 Python
 Mathematics
Detailed content of the Lecture:

For binary operations such as addition and multiplication, Pandas will


automatically align indices when passing the objects to the ufunc.
Here we are going to see how the universal functions are working in series and
DataFrames by
 Index preservation
 Index alignment

Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas
Series and DataFrame objects. We can use all arithmetic and special universal
functions as in NumPy on pandas. In outputs the index will preserved
(maintained) as shown below.
For series

01
12
23
34
dtype: int64

For DataFrame
df=[Link]([Link](0,10,(3,4)),
columns=['a','b','c','d'])
df

a b c d
0 1 4 1 4
1 8 4 0 4
2 7 7 7 2

For universal function. (here we use exponent as example)

Ufuncs for series


[Link](ser)

0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64
Ufuncs for Data Frame
[Link](df)

For universal function. (here we use exponent as example)

Ufuncs for series


[Link](ser)

0 8103.083928
1 54.598150
2 403.428793
3 20.085537
dtype: float64
Ufuncs for Data Frame
[Link](df)

Index Alignment
Pandas will align indices in the process of performing the operation. This is
very convenient when you are working with incomplete data, as we’ll.

Index alignment in Series


suppose we are combining two different data sources, then the index will aligned
accordingly.
x=[Link]([2,4,6],index=[1,3,5])
y=[Link]([1,
3,5,7],index=[1,
2,3,4]) x+y
1 3.0
2 NaN
3 9.0
4 NaN
5 NaN
dtype: float64
The resulting array contains the union of indices of the two input arrays,
which we could determine using standard Python set arithmetic on these
indices.
Any item for which one or the other does not have an entry is marked with
NaN, or “Not a Number,” which is how Pandas marks as missing data.

Fill value in missing data (fill_value)


If using NaN values is not the desired behavior, we can modify the fill value
using appropriate object methods in place of the operators.
[Link](y,fill_value=0)

1 3.0
2 3.0
3 9.0
4 7.0
5 6.0
dtype: float64

Operations between Data Frame and Series


When you are performing operations between a DataFrame and a Series, the
index and column alignment is similarly maintained. Operations between a
DataFrame and a Series are similar to operations between a two- dimensional
and one-dimensional NumPy array.
A = [Link](10, size=(3, 4))
array([[3, 8, 2, 4],
[2, 6, 4, 8],
[6, 1, 3, 8]])

A - A[0]
array([[ 0, 0, 0, 0],
[-1, -2, 2, 4],
[ 3, -7, 1, 4]])
Handling Missing Data
A number of schemes have been developed to indicate the presence of
missing data in a table or DataFrame. Generally, they revolve around one of
two strategies: using a mask that globally indicates missing values, or
choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean
array, or it may involve appropriation of one bit in the data representation to
locally indicate the null status of a value.

In the sentinel approach, the sentinel value could be some data-specific


convention, such as indicating a missing integer value with –9999 or some
rare bit pattern, or it could be a more global convention, such as indicating a
missing floating-point value with NaN (Not a Number), a special value which
is part of the IEEE floating-point specification.

Missing Data in Pandas


The way in which Pandas handles missing values is constrained by its NumPy
package, which does not have a built-in notion of NA values for non floating-
point data types.
NumPy supports fourteen basic integer types once you account for available
precisions, signedness, and endianness of the encoding. Reserving a specific
bit pattern in all available NumPy types would lead to an unwieldy amount of
overhead in special-casing various operations for various types, likely even
requiring a new fork of the NumPy package.
Pandas chose to use sentinels for missing data, and further chose to use two
already-existing Python null values: the special floatingpoint NaN value, and
the Python None object. This choice has some side effects, as we will see, but
in practice ends up being a good compromise in most cases of interest.

None: Pythonic missing data


This dtype=object means that the best common type representation NumPy
could infer for the contents of the array is that they are Python objects.
This dtype=object means that the best common type representation NumPy
could infer for the contents of the array is that they are Python objects.
NaN: Missing numerical data
NaN is a special floating-point value recognized by all systems that use the
standard IEEE floating-point representation.
vals2 =
[Link]([1,
[Link], 3,
4])
[Link]

dtype('float64')
You should be aware that NaN is a bit like a data virus—it infects any other
object it touches. Regardless of the operation, the result of arithmetic with
NaN will be another NaN

Pandas handling of NAs by type


Typeclass Conversion when storing NA sentinel
NAs value
floating No change [Link]
object No change None or [Link]
integer Cast to float64 [Link]
boolean Cast to object None or [Link]

Note : In Pandas, string data is always stored with an object dtype.

Operating on Null Values


there are several useful methods for detecting, removing, and replacing null
values in Pandas data structures. They are:
 isnull() - Generate a Boolean mask indicating missing values
 notnull() - Opposite of isnull()
 dropna() - Return a filtered version of the data
 fillna() - Return a copy of the data with missing values filled or imputed

Drop values in column or row


We can drop NA values along a different axis; axis=1 drops all columns containing
a null value.

all null values


You can also specify how='all', which will only drop rows/columns that are all null
values.

0123
0 1.0 NaN 2 NaN
1 2.0 3.0 5 NaN
2 NaN 4.0 6 NaN

Specific no of null values (thresh)


the thresh parameter lets you specify a minimum number of non-null values for
the row/column to be kept
[Link](axis='rows', thresh=3)
0123
1 2.0 3.0 5 NaN

Filling null values


Sometimes rather than dropping NA values, you’d rather replace them with a
valid value. This value might be a single number like zero, or it might be
some sort of imputation or interpolation from the good values. You could do
this in-place using the isnull() method as a mask, but because it is such a
common operation Pandas provides the fillna() method, which returns a copy
of the array with the null values replaced.
data = [Link]([1, [Link], 2,
None, 3], index=list('abcde')) data

Fill with single value


We can fill NA entries with a single value, such as zero
[Link](0) a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
dtype: float64
dtype: float64
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L35

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:


Topic of Lecture: Hierarchical Indexing, Combining Data sets
Introduction :
Up to this point we’ve been focused primarily on one-dimensional and two
dimensional data, stored in Pandas Series and Data Frame objects,
respectively. Often it is useful to go beyond this and store higher-dimensional
data—that is, data indexed by more than one or two keys.
Prerequisite knowledge for Complete understanding and learning of Topic:

 OOP language
 Python
 Database
Detailed content of the Lecture:

Authentication Requirement

Here we’ll explore the direct creation of MultiIndex objects; considerations


around indexing, slicing, and computing statistics across multiply indexed
data; and useful routines for converting between simple and hierarchically
indexed representations of your data.

A Multiply Indexed Series

Pandas MultiIndex
Pandas provides a better way. Our tuple-based indexing is essentially a
rudimentary multi-index, and the Pandas MultiIndex type gives us the type of
operations we wish to have. We can create a multi-index from the tuples as
follows
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop =
[Link](populations,
index=index) pop

(California, 2000) 33871648


(California, 2010) 37253956
(New York, 2000) 18976457
(New York, 2010) 19378102
(Texas, 2000) 20851820

Hierarchical representation of the data


pop =
[Link]
ndex(i
ndex)
pop
California 2000 33871648
2010 37253956
New York 2000
18976457
2010 19378102
Texas 2000 20851820
2010 25145561
dtype: int64
Here the first two columns of the Series representation show the multiple
index values, while the third column shows the data.

Access all data with second index


pop[:, 2010]
California 37253956
New York 19378102

MultiIndex as extra dimension


we could easily have stored the same data using a simple DataFrame with
index and column labels. The
unstack() method will quickly convert a multiplyindexed Series into a
conventionally indexed DataFrame.
pop_df
=
[Link]
stack()
pop_df

Add a new column in multi dimensional data frame.


pop_df = [Link]({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df

Universal functions
All the ufuncs and other functionality work with hierarchical indices.
f_u18 =
pop_df['under18'] /
pop_df['total']
f_u18.unstack()

2000 2010
California 0.273594 0.249211
New York 0.247010 0.222831
Texas 0.283251 0.273568
Methods of Multi Index Creation
To construct a multiply indexed Series or DataFrame is to simply pass a list of two
or more index arrays to the constructor.
df = [Link]([Link](4, 2), index=[['a', 'a', 'b', 'b'], [1, 2, 1,
2]],
columns=['data1', 'data2'])
df
data1 data2
a 1 0.554233 0.356072
2 0.925244 0.219474
b 1 0.441759 0.610054
2 0.171495 0.886688
if you pass a dictionary with appropriate tuples as keys, Pandas will
automatically recognize this and use a MultiIndex by default.

data = { ('California', 2000):


33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,

 Stacking and unstacking indices


it is possible to convert a dataset from a stacked multi-index to a simple two-
dimensional representation, optionally specifying the level to use.
[Link](level=0)
state
California
New York
Texas year
2000 33871648 18976457 20851820
2010 37253956 19378102 25145561
 Index setting and resetting
Another way to rearrange hierarchical data is to turn the index labels into
columns; this can be accomplished with the reset_index method. Calling this
on the population dictionary will result in a DataFrame with a state and year
column holding the information that was formerly in the index. For clarity, we
can optionally specify the name of the data for the column representation.
pop_flat =
pop.reset_index(name='pop
ulation') pop_flat
state year population
0 California 2000 33871648
1 California 2010 37253956
2 New York 2000 18976457
3 New York 2010 19378102
4 Texas 2000 20851820
5 Texas 2010 25145561

Calculate the average as follows


data_mean =
health_data.mean(lev
el='year') data_mean

subject Bob Guido Sue


type HR Temp HR
Temp HR Temp year
2013 37.5 38.2 41.0 35.85 32.0 36.95
2014 38.5 37.6 43.5 37.55 56.0 36.70
By further making use of the axis keyword, we can take the mean among levels
on the columns as well:
data_mean.mea
n(axis=1,
level='type')
type HR Temp
year
2013 36.833333 37.000000
2014 46.000000 37.283333

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272
Course Faculty

Verified by HoD

LECTURE HANDOUTS L36

CSE II/III
Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : IV – Python Libraries for Data Wrangling Date of Lecture:


Topic of Lecture: Aggregatiom and Grouping, Pivot Tables
Introduction :
Computing aggregations like sum(), mean(), median(), min(), and max(), in which a single
number gives insight into the nature of a potentially large dataset.
Prerequisite knowledge for Complete understanding and learning of Topic:
 OOP language
 Python
 Database
Detailed content of the Lecture:
Duplicate indices
One important difference between [Link] and [Link] is that
Pandas concatenation preserves indices, even if the result will have
duplicate indices! Consider this simple example.
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
[Link] = [Link] # make duplicate indices!
print(x); print(y); print([Link]([x, y]))

x y [Link]([x, y])
AB A B AB
0 A0 B0 0 A2 B2 0 A0 B0
1 A1 B1 1 A3 B3 1 A1 B1
0 A2 B2
1 A3 B3

The append() method


Series and Data Frame objects have an append method that can accomplish
the same thing in fewer keystrokes. For example, rather than calling
[Link]([df1, df2]), you can simply call [Link](df2):
print(df1);
print(df2);
print([Link](df
2)) df1 df2

[Link](df2)

AB AB AB
1 A1 B1 3 A3 B3 1 A1 B1
2 A2 B2 4 A4 B4 2 A2 B2
3 A3 B3
4 A4 B4

Merge and Join


One essential feature offered by Pandas is its high-performance, in-memory join
and merge operations.

Categories of Joins
 One-to-one joins
 Many-to-one joins
 Many-to-many joins

One – to – one joins


The simplest type of merge expression is the one-to-one join, which
is in many ways very similar to the column-wise concatenation.
df1 = [Link]({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})

df2 = [Link]({'employee':
['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012,
2014]})

Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains
duplicate entries. For the many-to- one case, the resulting DataFrame will
preserve those duplicate entries as appropriate.
df4 = [Link]({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
[Link](df3, df4)

employee group hire_date supervisor


0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
The resulting DataFrame has an additional column with the “supervisor”
information, where the information is repeated in one or more locations as
required by the inputs.

Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless
well defined. If the key column in both the left and right array contains
duplicates, then the result is a many-to-many merge. This will be perhaps
most clear with a concrete example.
df5 = [Link]({'group': ['Accounting', 'Accounting', 'Engineering',
'Engineering', 'HR', 'HR'], 'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
[Link](df1, df5)

employee group skills


0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
Pivot Tables:
A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data. The pivot table takes simple column wise
data as input, and groups the entries into a two- dimensional table that provides a
multidimensional summarization of the data. The difference between pivot tables and
GroupBy can sometimes cause confusion; it helps me to think of pivot tables as
essentially a multidimensional version of GroupBy aggregation. That is, you split
apply- combine, but both the split and the combine happen across not a one-
dimensional index, but across a two-dimensional grid.

Pivot Table Creation


Import pandas as pd
df=pd.read_csv('D:\[Link]')
df.pivot_table('preg',index='age',columns='Class').sample(10)
#here diabetes data set has large no of rows so we use sample()

Class tested_negative tested_positive


age
63 5.500000 NaN
28 3.440000 2.000000
61 7.000000 4.000000
69 5.000000 NaN
45 7.285714 7.375000
62 6.500000 1.000000
53 2.000000 6.250000
68 8.000000 NaN
23 1.516129 1.857143
Clas tested_negativ tested_positiv
s e e
age

52 13.000000 3.428571
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272
Course Faculty

Verified by HoD

LECTURE HANDOUTS L37

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:


Topic of Lecture: Importing Matplotlib, Line plots
Introduction
The simplest of all plots is the visualization of a single function y = f(x). Here we will
take a first look at creating a simple plot of this type. An instance can be thought of
as a single container that contains all the objects representing axes, graphics, text,
and labels.
Prerequisite knowledge for Complete understanding and learning of Topic:
 OOP language
 Python
 Database
Detailed content of the Lecture:
Line Colors and Styles
 The first adjustment you might wish to make to a plot is to control the line
colors and styles.
 To adjust the color, you can use the color keyword, which accepts a string
argument representing virtually any imaginable color. The color can be
specified in a variety of ways
 If no color is specified, Matplotlib will automatically cycle through a set of
default colors for multiple lines

Different forms of color representation.


specify color by name - color='blue'
short color code (rgbcmyk) - color='g'
Grayscale between 0 and 1 - color='0.75'
Hex code (RRGGBB from 00 to
FF) - color='#FFDD44' RGB
tuple, values 0 and 1 -
color=(1.0,0.2,0.3) all HTML
color names supported -
color='chartreuse'
 We can adjust the line style using the linestyle keyword.
Different line styles
linestyle='solid'
linestyle='dashed'
linestyle='dashdot'
linestyle='dashdot'

Short assignment
linestyle='-' # solid
linestyle='-' # solid
linestyle='-.' # dashdot

 linestyle and color codes can be combined into a single nonkeyword


argument to the [Link]() function

[Link](x, x + 0, '-g') # solid green


[Link](x, x + 1, '--c') # dashed cyan

 The most basic way to adjust axis limits is to use the


[Link]() and [Link]() methods Example
[Link](10,
0)
import
matplotlib.p
yplot as plt
import
numpy
[Link](
0, 10,
1000)
[Link](x,
[Link](x));
[Link](x, [Link](x - 0), color='blue') #
specify color by name [Link](x, [Link](x -
1), color='g') # short color code
(rgbcmyk) [Link](x, [Link](x - 2),
color='0.75') # Grayscale between 0 and
1
[Link](x, [Link](x - 3), color='#FFDD44') # Hex
code (RRGGBB from 00 to FF) [Link](x, [Link](x -
4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 and 1
[Link](x, [Link](x - 5), color='chartreuse');# all
HTML color names supported
ax = [Link]()
x=
[Link](0,
10, 1000)
[Link](x, x +
0,
linestyle='soli
d') [Link](x, x
+ 1,
linestyle='das
hed')
[Link](x, x +
2,
linestyle='das
hdot')
[Link](x, x +
3,
linestyle='dott
ed');
# For short, you can
use the following
codes: [Link](x, x +
4, linestyle='-') # solid
[Link](x, x + 5,
linestyle='--') #
dashed [Link](x, x +
6, linestyle='-.') #
dashdot [Link](x, x +
7, linestyle=':'); #
dotted
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L38

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:


Topic of Lecture: Scatter Plots, Visualization errors
Introduction :
Commonly used plot type is the simple scatter plot, a close cousin of the line plot.
Instead of points being joined by line segments, here the points are represented
individually with a dot, circle, or other shape
Prerequisite knowledge for Complete understanding and learning of Topic:
 Distributions
 Pythons
 Probability
Detailed content of the Lecture:

Another commonly used plot type is the simple scatter plot, a close cousin of
the line plot. Instead of points being joined by line segments, here the points
are represented individually with a dot, circle, or other shape.
Syntax
[Link](x, y, 'type of symbol ', color);
Example
[Link](x, y, 'o', color='black');
 The third argument in the function call is a character that represents the
type of symbol used for the plotting. Just as you can specify options such
as '-' and '--' to control the line style, the marker style has its own set of
short string codes.
Example
 Various symbols used to specify ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']
 Short hand assignment of line, symbol and color also allowed.
[Link](x, y, '-ok');

 Additional arguments in [Link]()


We can specify some other parameters related with scatter plot which
makes it more attractive. They are color, marker size, linewidth, marker face
color, marker edge color, marker edge width, etc
Scatter Plots with [Link]
 A second, more powerful method of creating scatter plots is the [Link]
function, which can be used very similarly to the [Link] function
[Link](x, y, marker='o');
 The primary difference of [Link] from [Link] is that it can be used to
create scatter plots where the properties of each individual point (size,
face color, edge color, etc.) can be individually controlled or mapped to
data.
 Notice that the color argument is automatically mapped to a color scale
(shown here by the colorbar() command), and the size argument is given
in pixels.
 Cmap – color map used in scatter plot gives different color combinations.
Perceptually Uniform Sequential
['viridis', 'plasma', 'inferno', 'magma']
Sequential
['Greys', 'Purples', 'Blues', 'Greens', 'Oranges', 'Reds', 'YlOrBr',
'YlOrRd',

'OrRd', 'PuRd', 'RdPu', 'BuPu', 'GnBu', 'PuBu', 'YlGnBu', 'PuBuGn',


import
'BuGn', numpy as np
'YlGn']
import [Link] as
Sequential (2)
plt x = [Link](0, 10,
30)
['binary', 'gist_yarg', 'gist_gray', 'gray', 'bone', 'pink', 'spring',
'summer',
y = [Link](x)
[Link](x,
'autumn', y, 'o', color='black');
'winter', 'cool', 'Wistia', 'hot', 'afmhot', 'gist_heat',
'copper']

Scatter plot with edge color, face color, size, and width of marker.
(Scatter plot with line)
import numpy as np
import [Link] as plt x = [Link](0,
10, 20)
y = [Link](x)
[Link](x, y, '-o',
color='gray',
markersize=15,
linewidth=4,
markerfacecolor='yello
w',
markeredgecolor='red'
, markeredgewidth=4)
[Link](-1.5, 1.5);

Scatter plot with random colors, size and transparency:


import numpy as np
import [Link] as
plt rng =
[Link](0) x
= [Link](100)
y = [Link](100)
colors =
[Link](100)
sizes = 1000 * [Link](100)
[Link](x, y, c=colors, s=sizes,
alpha=0.3, map='viridis')
[Link]()

For any scientific measurement, accurate accounting for errors is nearly as


important, if not more important, than accurate reporting of the number itself.
For example, imagine that I am using some astrophysical observations to
estimate the Hubble Constant, the local measurement of the expansion rate of
the Universe.
In visualization of data and results, showing these errors effectively can make
a plot convey much more complete information.
Types of errors
 Basic Errorbars
 Continuous Errors

Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call.
import
matplotlib.
pyplot as
plt
[Link]
('seaborn-
whitegrid')
import
numpy as
np
x = [Link](0, 10, 50)
dy = 0.8
y = [Link](x) + dy * [Link]

 Here the fmt is a format code controlling the appearance of lines and
points, and has the same syntax as the shorthand used in [Link]()
 In addition to these basic options, the errorbar function has many
options to fine tune the outputs. Using these additional options you
can easily customize the aesthetics of your errorbar plot.
[Link](x, y, yerr=dy, fmt='o', color='black',ecolor='lightgray', elinewidth=3,
capsize=0);

Continuous Errors
 In some situations it is desirable to show errorbars on continuous
quantities. Though Matplotlib does not have a built-in convenience
routine for this type of application, it’s relatively easy to combine
primitives like [Link] and plt.fill_between for a useful result.
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L39

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:


Topic of Lecture: Density and Contour plots
Introduction :

Biometric authentication is simply the process of verifying your identity using your
measurement or other unique characteristics of your body, then logging you in a
service, an app, a device and so [Link] identification verifies you are you
based on your body [Link] identification systems can be grouped
based on the main physical characteristic that leads itself to biometric identification.
Fingerprint identification, hand geometry, retina scan, iris scan, face recognition,
signature voice [Link] authentication is also called weak authentication
This is amongst the most conventional schemes where in a user has an user it and a
password. Uses id acts like a claim and password as evidence supporting the claim.
Prerequisite knowledge for Complete understanding and learning of Topic:
 Linear Algebra
 Number Theory
 Combinatory
 Programming language such as java and python
Detailed content of the Lecture:

Density and Contour Plots


To display three-dimensional data in two dimensions using
contours or color-coded regions. There are three Matplotlib
functions that can be helpful for this task:
 [Link] for contour plots,
 [Link] for filled contour plots, and
 [Link] for showing images.

Visualizing a Three-Dimensional Functio


A contour plot can be created with the [Link] function.
 a grid of x values,
 a grid of y values, and
 a grid of z values.
The x and y values represent positions on the plot, and the
represented by the contour levels.
The way to prepare such data is to
use the [Link] function,
which builds two-dimensional
grids from one- dimensional
arrays:
Example
def f(x, y):

return [Link](x) **
10 + [Link](10 + y * x) *
[Link](x) x = [Link](0,
5, 50)

[Link](X, Y, Z, colors='black');

 Notice that by default when a single color is used, negative values are
represented by dashed lines, and positive values by solid lines.
 Alternatively, you can color-code the lines by specifying a colormap with the
cmap argument.
 We’ll also specify that we want more lines to be drawn—20 equally spaced
intervals within the data range.
 One potential issue with this plot is that it is a bit “splotchy.” That is, the
color steps are discrete rather than continuous, which is not always what
is desired.
 You could remedy this by setting the number of contours to a very high
number, but this results in a rather inefficient plot: Matplotlib must render
a new polygon for each step in the level.
 A better way to handle this is to use the [Link]() function, which
interprets a two-dimensional grid of data as an image.

There are a few potential gotchas with imshow().


 [Link]() doesn’t accept an x and y grid, so you must manually specify
the extent [xmin, xmax, ymin, ymax] of the image on the plot.
 [Link]() by default follows the standard image array definition where
the origin is in the upper left, not in the lower left as in most contour
plots. This must be changed when showing gridded data.

Finally, it can sometimes be useful


to combine contour plots and
image plots. we’ll use a partially
transparent background image
(with transparency set via the
alpha parameter) and over-plot
contours with labels on the
contours themselves (using the
[Link]() function):
contours = [Link](X, Y, Z,
3, colors='black')
[Link](contours,
inline=True, fontsize=8)
[Link](Z, extent=[0, 5, 0,
5], origin='lower',
cmap='RdGy', alpha=0.5)
[Link]();

Example Program
import numpy as np
import
matplotlib.p
yplot as plt
def f(x, y):
return [Link](x) ** 10 +
[Link](10 + y * x) *
[Link](x)
x = [Link](0, 5, 50)
y =
[Link]
nspa
ce(0,
5,
40)
X, Y
=
np.
mes
hgri
d(x,
y) Z
=
f(X,
Y)
[Link](Z,
extent=[0, 10,
0, 10],
origin='lower',
cmap='RdGy')
[Link]()
Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L40

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:


Topic of Lecture: Histogram, Legends
Introduction
Histogram is the simple plot to represent the large data set. A histogram is
a graph showing frequency distributions. It is a graph showing the number of
observations within each given interval.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Linear Algebra
 Number Theory
 Combinatorics
 Programming language such as java and python
Detailed content of the Lecture:

Histograms
 Histogram is the simple plot to represent the large data set. A histogram
is a graph showing frequency distributions. It is a graph showing the
number of observations within each given interval.

Parameters
 [Link]( ) is used to plot histogram. The hist() function will use an array of
numbers to create a histogram, the array is sent into the function as an
argument.

 bins - A histogram displays numerical data by grouping data into "bins" of


equal width. Each bin is plotted as a bar whose height corresponds to how
many data points are in that bin. Bins are also sometimes called "intervals",
"classes", or "buckets".
 normed - Histogram normalization is a technique to distribute the
frequencies of the histogram over a wider range than the current range.
 x - (n,) array or sequence of (n,) arrays Input values, this takes either a
single array or a sequence of arrays which are not required to be of the
same length.
 histtype - {'bar', 'barstacked', 'step', 'stepfilled'}, optional The type of
histogram to draw.

 'bar' is a traditional bar-type histogram. If multiple data are given the


bars are arranged side by side.
 'barstacked' is a bar-type histogram where multiple data are stacked
on top of each other.
 'step' generates a lineplot that is by default unfilled.
 'stepfilled' generates a
lineplot that is by default filled.
Default is 'bar'

 'left': bars are centered on the left bin edges.


 'mid': bars are centered between the bin edges.
 'right': bars are
centered on the right bin
edges. Default is 'mid'
 orientation - {'horizontal', 'vertical'}, optional
If 'horizontal', barh will be used for bar-type histograms and the bottom
kwarg will be the left edges.
 color - color or array_like of colors or None, optional
Color spec or sequence of color specs, one per dataset. Default (None) uses
the standard line color sequence.
Default is None
 label - str or None, optional. Default is None

Other parameter
 **kwargs - Patch
properties, it allows us to
pass a variable number
of keyword arguments to
a python function. **
denotes this type of
function.

Example
import numpy as np
import [Link] as plt
[Link]('seaborn-white')
data = [Link](1000)
[Link](data);

The hist() function has many options to tune both the calculation and the
display; here’s an example of a more customized histogram.
[Link](data, bins=30, alpha=0.5,histtype='stepfilled',
color='steelblue',edgecolor='none');

The [Link] docstring has more information on other customization options


available. I find this combination of histtype='stepfilled' along with some
transparency alpha to be very useful when comparing histograms of several
distributions
x1 = [Link](0, 0.8, 1000)
x2 = [Link](-2, 1, 1000)
x3 = [Link](3, 2, 1000)
kwargs =
dict(histtype='stepfilled',
alpha=0.3, bins=40)
[Link](x1, **kwargs)

Legends
Plot legends give meaning to a visualization, assigning labels to the various plot
elements. We previously saw how to create a simple legend; here we’ll take a
look at customizing the placement and aesthetics of the legend in Matplotlib.
Plot legends give meaning to a visualization, assigning labels to the various plot
elements. We previously saw how to create a simple legend; here we’ll take a
look at customizing the placement and aesthetics of the legend in Matplotlib
[Link](x, [Link](x), '-b', label='Sine')
[Link](x, [Link](x), '--
r', label='Cosine')
[Link]();
Customizing Plot Legends
Location and turn off the frame - We can specify the location and turn off
the frame. By the parameter loc and framon.

[Link](loc='upp
er left',
frameon=False) fig

Number of columns - We can use the ncol command to specify the number of
columns in the legend.

[Link](frameon=False,
loc='lower center', ncol=2) fig

Rounded box, shadow and frame transparency


We can use a rounded box (fancybox) or add a shadow, change the
transparency (alpha value) of the frame, or change the padding around the text.
[Link](fancybox=True, framealpha=1,
shadow=True, borderpad=1) fig

Choosing Elements for the Legend


 The legend includes all labeled elements by default. We can change which
elements and labels appear in the legend by using the objects returned
by plot commands.
 The [Link]() command is able to create multiple lines at once, and returns
a list of created line instances.
Passing any of these to [Link]() will tell it which to identify, along with the
labels we’d like to specify
y = [Link](x[:, [Link]] + [Link]
* [Link](0, 2, 0.5)) lines =
[Link](x, y)
[Link](lines[:2],['first','second']);
# Applying label
individually. [Link](x, y[:,
0], label='first')
[Link](x, y[:, 1], label='second')
[Link](x, y[:, 2:])
[Link](framealpha=1,
frameon=True);

Multiple legends
It is only possible to create a single
legend for the entire plot. If you try to
create a second legend using [Link]()
or [Link](), it will simply override the
first one. We can work around this by
creating a
new legend artist from scratch, and then using the lower-level ax.add_artist()
method to manually add the second artist to the plot

Example
import
[Link] as plt
[Link]('classic')
import numpy as np
x = [Link](0, 10, 1000)
[Link](loc='lower center', frameon=True,
shadow=True,borderpad=1,fancybox=True) fig
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L41

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V- Data Visualization Date of Lecture:

Topic of Lecture: Colors, Subplots


Introduction :
Matplotlib has the concept of subplots: groups of smaller axes that can exist
together within a single [Link] subplots might be insets, grids of plots, or
other more complicated layouts.
We’ll explore four routines for creating subplots in Matplotlib.
Prerequisite knowledge for Complete understanding and learning of Topic:
 Linear Algebra
 Number Theory
 Combinations
 Programming language such as java and python
Detailed content of the Lecture:

Subplots
 Matplotlib has the concept of subplots: groups of smaller axes that can exist
together within a single figure.
 These subplots might be insets, grids of plots, or other more complicated
layouts.
 We’ll explore four routines for creating subplots in Matplotlib.
 [Link]: Subplots by Hand
 [Link]: Simple Grids of Subplots
 [Link]: The Whole Grid in One Go
 [Link]: More Complicated Arrangements.

[Link]: Subplots by Hand


 The most basic method of creating an axes is to use the [Link] function.
As we’ve seen previously, by default this creates a standard axes object
that fills the entire figure.
 [Link] also takes an optional argument that is a list of four numbers in the
figure coordinate system.
 These numbers represent [bottom, left, width,height] in the figure coordinate
system, which ranges from 0 at the bottom left of the figure to 1 at the top
right of the figure.

 This command takes three integer arguments—the number of rows, the


number of columns, and the index of the plot to be created in this scheme,
which runs from the upper left to the bottom right

[Link]: The Whole Grid in One Go


 The approach just described can become quite tedious when you’re
creating a large grid of subplots, especially if you’d like to hide the x-
and y-
axis labels on the inner plots.
 For this purpose,
[Link]() is the
easier tool to use
(note the s at the end
of subplots).
 Rather than creating a
single subplot, this
function creates a full
grid of subplots in a
single line, returning
them in a NumPy
array.
 The arguments are the
number of rows and
number of columns,
along with optional
keywords sharex and
sharey, which allow
you to specify the
relationships between
different axes.

 Here we’ll create a 2×3 grid of subplots, where all axes in the same row
share their y- axis scale, and all axes in the same column share their x-axis
scale
fig, ax = [Link](2, 3, sharex='col', sharey='row')
Note that by specifying sharex and sharey, we’ve automatically removed inner
labels on the grid to make the plot cleaner.

[Link]: More Complicated Arrangements


To go beyond a regular grid to subplots that span multiple rows and columns,
[Link]() is the best tool. The [Link]() object does not create a plot
by itself; it is simply a convenient interface that is recognized by the
[Link]() command.
For example, a gridspec for a grid of two rows and three columns with some
specified width and height space looks like this:
grid = [Link](2, 3, wspace=0.4, hspace=0.3) From this we can specify
subplot locations and extents [Link](grid[0, 0])
[Link](grid[0, 1:])
[Link](grid[1, :2])
[Link](grid[1, 2]);

Text and Annotation


 The most basic types of annotations we will use are axes labels and titles,
here we will see some more visualization and annotation information’s.
 Text annotation can be done manually with the [Link]/[Link] command,
which will place text at a particular x/y value.
 The [Link] method takes an x position, a y position, a string, and then
optional keywords specifying the color, size, style, alignment, and other
properties of the text. Here we used ha='right' and ha='center', where ha
is short for horizontal alignment.

Transforms and Text Position


 We anchored our text annotations to data locations. Sometimes it’s
preferable to anchor the text to a position on the axes or figure,
independent of the data. In Matplotlib, we do this by modifying the
transform.
 Any graphics display framework needs some scheme for translating between
coordinate systems.
 Mathematically, such coordinate transformations are relatively
straightforward, and Matplotlib has a well- developed set of tools that it uses
internally to perform them (the tools can be explored in the
[Link] submodule).
 There are three predefined transforms that can be useful in this situation.

o [Link] - Transform associated with data coordinates


o [Link] - Transform associated with the axes (in units of axes
dimensions)
o [Link] - Transform associated with the figure (in units of figure
dimensions)

Example
import
matplotlib.p
yplot as plt
import
matplotlib
as mpl
[Link]
('seaborn-
whitegrid')
import
numpy as
np
import pandas as pd
fig, ax =
[Link](facecolo
r='lightgray')
[Link]([0, 10, 0,
10])
# transform=[Link] is the default, but
we'll specify it anyway [Link](1, 5, ". Data:
(1, 5)", transform=[Link])
[Link](0.5, 0.1, ". Axes: (0.5, 0.1)", transform=[Link])

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272
Course
Faculty

Verified by HoD

LECTURE HANDOUTS L42

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V- Data Visualization Date of Lecture:

Topic of Lecture: Test and Annotations, Customization


Introduction :
Matplotlib has the concept of subplots: groups of smaller axes that can exist
together within a single [Link] subplots might be insets, grids of plots, or
other more complicated layouts.
We’ll explore four routines for creating subplots in Matplotlib.
Prerequisite knowledge for Complete understanding and learning of Topic:
 Linear Algebra
 Number Theory
 Combinations
 Programming language such as java and python
Detailed content of the Lecture:

Text and Annotation


 The most basic types of annotations we will use are axes labels and titles,
here we will see some more visualization and annotation information’s.
 Text annotation can be done manually with the [Link]/[Link] command,
which will place text at a particular x/y value.
 The [Link] method takes an x position, a y position, a string, and then
optional keywords specifying the color, size, style, alignment, and other
properties of the text. Here we used ha='right' and ha='center', where ha
is short for horizontal alignment.

Transforms and Text Position


 We anchored our text annotations to data locations. Sometimes it’s
preferable to anchor the text to a position on the axes or figure,
independent of the data. In Matplotlib, we do this by modifying the
transform.
 Any graphics display framework needs some scheme for translating between
coordinate systems.
 Mathematically, such coordinate transformations are relatively
straightforward, and Matplotlib has a well- developed set of tools that it
uses internally to perform them (the tools can be explored in the
[Link] submodule).
 There are three predefined transforms that can be useful in this situation.

o [Link] - Transform associated with data coordinates


o [Link] - Transform associated with the axes (in units of axes
dimensions)
o [Link] - Transform associated with the figure (in units of figure
dimensions)

Example
import [Link] as
plt import matplotlib as mpl
[Link]('seaborn-
whitegrid') import numpy as
np
import pandas as pd
fig, ax = [Link](facecolor='lightgray') [Link]([0,
10, 0, 10])
# transform=[Link] is the default, but
we'll specify it anyway [Link](1, 5, ". Data:
(1, 5)", transform=[Link])
[Link](0.5, 0.1, ". Axes: (0.5, 0.1)", transform=[Link])

Video Content / Details of website for further learning:

 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course
Faculty

Verified by HoD
LECTURE HANDOUTS L43

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:


Topic of Lecture: Three-Dimensional Plotting in Matplotlib
Introduction :
We enable three-dimensional plots by importing the mplot3d toolkit, included with
the main Matplotlib installation.

Prerequisite knowledge for Complete understanding and learning of Topic:


 Linear Algebra
 Number Theory
 Combinations
 Programming language such as java and python
Detailed content of the Lecture:

Three-Dimensional Points and Lines:


The most basic three-dimensional plot is a line or scatter plot created from sets of
(x, y, z) triples.
In analogy with the more common two-dimensional plots discussed earlier, we can
create these using the ax.plot3D
and ax.scatter3D functions
import numpy as np
import [Link] as plt from
mpl_toolkits import mplot3d ax =
[Link](projection='3d')
# Data for a three-dimensional
line zline = [Link](0, 15,
1000) xline = [Link](zline)
yline = [Link](zline)
ax.plot3D(xline, yline,
zline, 'gray')
# Data for three-dimensional scattered
points zdata = 15 *
[Link](100)
xdata = [Link](zdata) + 0.1 * [Link](100) ydata = [Link](zdata) + 0.1
* [Link](100).

Three-Dimensional Contour Plots


 mplot3d contains tools to create three-dimensional relief plots using the
same inputs.
 Like two-dimensional [Link] plots, ax.contour3D requires all the input
data to be in the form of two- dimensional regular grids, with the Z data
evaluated at each point.
 Here we’ll show a three-dimensional contour diagram of a three dimensional
sinusoidal function

Wire frames and Surface Plots


 Two other types of three-dimensional plots that work on gridded data are
wireframes and surface plots.
 These take a grid of values and project it onto the specified
threedimensional surface, and can make the resulting three-dimensional
forms quite easy to visualize.

import numpy as np
import [Link] as
plt from mpl_toolkits import
mplot3d fig = [Link]()
ax =
[Link](projection='3d')
ax.plot_wireframe(X, Y, Z,
color='black')
ax.set_title('wireframe');
[Link]()
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347

Course Faculty

Verified by HoD

LECTURE HANDOUTS
L44

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:


Topic of Lecture: Geographic Data with Basemap
Introduction :
One common type of visualization in data. Matplotlib’s main tool for this type of
visualization is the Basemap toolkit, which is one of several Matplotlib
toolkits that live under the mpl_toolkits namespace. Basemap is a useful tool for
Python users to have in their virtual toolbelts. Installation of Basemap. Once you
have the Basemap toolkit installed and imported, geographic plots also require the
PIL package in Python 2, or the pillow package.
Prerequisite knowledge for Complete understanding and learning of Topic:
 Linear Algebra
 Number Theory
 Combinations
 Programming language such as java and python
Detailed content of the Lecture:
import numpy as np
import [Link] as plt
from mpl_toolkits.basemap
Basemap [Link](figsize=(8,
8))
m = Basemap(projection='ortho', resolution=None, lat_0=50,
lon_0=-100)
[Link](scale=0.5);

 Matplotlib axes that understands spherical coordinates and allows us to


easily over-plot data on the map
 We’ll use an etopo image (which shows topographical features both on
land and under the ocean) as the map background
 Program to display particular area of the map with latitude and longitude
lines
import numpy as np
import [Link] as plt
from mpl_toolkits.basemap
import Basemap from itertools
import chain
fig = [Link](figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None, width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
[Link](scale=0.5,
alpha=0.5) def
draw_map(m, scale=0.2):

Map Projections:
The Basemap package implements several dozen such projections, all referenced
by a short format code. Here we’ll briefly demonstrate some of the more common
ones.
 Cylindrical projections
 Pseudo-cylindrical projections
 Perspective projections
 Conic projections

Cylindrical projection
 The simplest of map projections are cylindrical projections, in which lines
of constant latitude and longitude are mapped to horizontal and vertical
lines, respectively.
 This type of mapping represents equatorial regions quite well, but
results in extreme distortions near the poles.
 The spacing of latitude lines varies between different cylindrical
projections, leading to different conservation properties, and different
distortion near the poles.
 Other cylindrical projections are the Mercator (projection='merc')
and the cylindrical equal-area (projection='cea') projections.
 The additional arguments to Basemap for this view specify the latitude
(lat) and longitude (lon) of the lower-left corner (llcrnr) and upper-right
corner (urcrnr) for the desired map, in units of degrees.
import numpy as np

import [Link] as plt

from mpl_toolkits.basemap import Basemap

Pseudo-cylindrical projections
 Pseudo-cylindrical projections relax the requirement that meridians
(lines of constant longitude) remain vertical; this can give better
properties near the poles of the projection.
 The Mollweide projection (projection='moll') is one common example of
this, in which all meridians are elliptical arcs
 It is constructed as map: though there are distortions near the poles, the
area of small patches reflects the true area.
 Other pseudo-cylindrical projections are the sinusoidal (projection='sinu')
and Robinson (projection='robin') projections.

import numpy as np
import [Link] as plt
from mpl_toolkits.basemap import Basemap fig = [Link](figsize=(8, 6),
edgecolor='w')
m = Basemap(projection='moll', resolution=None, lat_0=0,
lon_0=0)
draw_map(m)

Plotting Data on Maps


 The Basemap toolkit is the ability to over-plot a variety of data onto a map
background.
 There are many map-specific functions available as
methods of the Basemap instance. Some of these map-
specific methods are:
contour()/contourf() - Draw contour
lines or filled contours imshow() -
Draw an image
pcolor()/pcolormesh() - Draw a pseudocolor plot for
irregular/regular meshes plot() - Draw lines and/or
markers
scatter() - Draw points with markers
quiver() - Draw vectors
barbs() - Draw wind barbs
drawgreatcircle() - Draw a great
circle
Video Content / Details of website for further learning:
 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:
 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD

LECTURE HANDOUTS L45

CSE II/III

Course Name with Code : CS3352 & Foundations of Data Science

Course Faculty : [Link]

Unit : V - Data Visualization Date of Lecture:

Topic of Lecture: Visualization with Seaborn


Introduction :

The main idea of Seaborn is that it provides high-level commands to create a


variety of plot types useful for statistical data exploration, and even some
statistical model fitting.
Prerequisite knowledge for Complete understanding and learning of Topic:

 Linear Algebra
 Number Theory
 Combinatorics
 Programming language such as java and python
Detailed content of the Lecture:

 In statistical data visualization, all


you want is to plot histograms
and joint distributions of
variables. We have seen that this
is relatively straightforward in
Matplotlib
 Rather than a histogram, we can
get a smooth estimate of the
distribution using a kernel density
estimation, which Seaborn does
with [Link]
data = [Link].multivariate_normal([0, 0], [[5, 2], [2,
2]], size=2000)
data = [Link](data, columns=['x', 'y'])
for col in 'xy':

 Histograms and KDE can be combined using distplot


[Link](data['x'])
[Link](data['y']);

 If we pass the full two-dimensional


dataset to kdeplot, we will get a two-
dimensional.

We can see the joint distribution and the marginal distributions together using
[Link]

Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with
pair plots. This is very useful for exploring correlations between multidimensional
data, when you’d like to plot all pairs of values against each other.
We’ll demo this with the Iris dataset, which lists measurements of petals and
sepals of three iris species:
import seaborn as sns
iris = sns.load_dataset("iris")
[Link](iris, hue='species',
size=2.5);
Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you
to view the distribution of a parameter within bins defined by any other
parameter.
Joint distributions
Similar to the pair plot we saw earlier, we can use [Link] to show the
joint distribution between different datasets, along with the associated marginal
distributions.
Joint distributions
Similar to the pair plot we saw earlier, we can use [Link] to show the
joint distribution between different datasets, along with the associated marginal
distributions.

contour()/contourf() - Draw contour lines or


filled contours imshow() - Draw an image
pcolor()/pcolormesh() - Draw a pseudocolor plot for
irregular/regular meshes plot() - Draw lines and/or markers
scatter() - Draw points with markers quiver()
- Draw vectors barbs() - Draw wind barbs
drawgreatcircle() - Draw a great circle

Video Content / Details of website for further learning:


 [Link]
 [Link]
 [Link]
 [Link]
Important Books/Journals for further learning including the page nos.:

 Data science from scratch :author joel grus page nos: 403
 Essential math of data science, authorThomas neild ppage nos:347
 Become a data head author:alex gutman page nos-272

Course Faculty

Verified by HoD
Department of Computer Science and Engineering
Question Bank – Academic Year (2024-25)
Course code & Course Name: CS3352 & Foundations of Data science
Name of the Faculty : [Link]
Year/Sem/Sec : II/III/A

UNIT-I

PART-A (2 Marks)
1. Define Data Science.

2. What is big data and list out?

3. Identify the components of data science.

4. Enumerate the categories of data used in data science.

5. What is project charter?

6. Define data warehouse, data mart and data lake.

7. What is meant by data cleaning?

8. What is confusion matrix?

9. Differentiate data science and data mining.

10. Define EDA.

PART-B (13 marks)


1. Give an overview of the data science process.

2. Explain about any three application domains of data science process.

3. Describe the categories of data for data mining.

4. Explain the different stages of data preparation phase.

5. Discuss the methods used for identifying outliers in the data.

PART- C (15 Marks)

1. Describe the steps in KDD in Data mining.

2. Discuss in brief about the tools for data science model building.

3. Describe the approaches for data Exploration.

UNIT-II
PART-A (2 marks)
1. Define data. What are the type of data
2. What is qualitative data? Give example.

3. Define approximate number.

4. What is percentile rank?

5. State the differences between a histogram and bar graph.

6. What is variance?

7. What is z score?

8. How will you convert a z score to original score?

9. Compare discrete and continuous variables.

10. What are the types of frequency distribution?

PART- B (13 marks)

1. Elaborate the different ways to describe or represent data using tables with
suitable example.

2. Explain the various way by which data can be represented or described using
graph

with suitable example and diagrams?

3. Explain the different types of frequency distribution with suitable example and
diagrams.

4. Using the computation formula for the sum of squares, calculate the population

standard deviation and sample deviation for the score? 1,3,7,2,0,4,3,7,


7,9,2.

5. Compute the mean ,median and mode for the following data sets

45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70 *26, 26, 28, 27, 26, 27, 26, 26.

PART- C (15 Marks)

1. Using computation formula for the sum of squares calculate the population
standard

deviation and the sample standard deviation for the scores:

1, 3,7,2,0,4,3,7

10, 8, 5,0,1,7,9,2,1

2. Consider the test scores approximating a normal curve with a mean of 500 and
a standard deviation of 100. Sketch a normal curve and shade in the target
describe by the following:

More than 570

Less than 515

Between 520 and 540


Plan the solutions for the target areas. Convert to Z scores and find the
proportions

that correspond to the target areas.

3. Construct the histogram and convert it to a frequency polygon for the


following

139,139,139,145,1475,150,145,136,150,152,144,138,138,150,149,133,13
4,152,155,151

UNIT-III

PART-A (2 marks)
1. What is correlation?

2. What are the types of correlation and need?

3. What is a scatterplot?

4. What is a linear relationship and types?

5. List the types of nonlinear relationship

6. Define curvilinear relationship.

7. What is outlier?

8. Define regression

9. Compare correlation and regression

10. What is the Interpretation of (r)?

PART- B (13 marks)

1. Explain in detail about correlation and types.

2. What are scatterplots and explain with an examples.

3. Highlights the significance of the correlation coefficient r compare the various

correlation coefficients.

4. For the standard error of the estimate of the mean weight of high school
football

player using the data given of weights of the players

Player Number : 1 ,2, 3, 4, 5, 6, 7, 8, 9, 10

Weights in Pounds :150 ,203, 176, 190 ,168 ,193, 189, 178, 197, 172

5. Elaborate on multiple regression equations

6. Explain in detail about significance of (r). Give a detailed interpretation of (r)2


PART- C (15 Marks)

1. Calculate and analyse the correlation coefficient between the number of study
hours

and sleeping hours of different students:

Study hours: 2 4 6 8 10

Sleeping hours: 10 9 8 7 6

2. Find the value of correlation coefficient from:

Subject : 1 ,2 ,3 ,4 ,5 ,6

Age: 43,21,25,42,57,59

Glucose level: 99,65,79,75,87,81

3. Illustrate (i) scatterplots (ii) least squares regression equation

UNIT-IV

PART-A (2 marks)
1. What is numpy? List its use.

2. Write python code to create 1D, 2D and 3D numpy arrays.

3. Compare python list and array

4. Enumerate the attributes of a numpy array.

5. What is indexing and negative indexing in tuples?

6. Define fancy indexing.

7. What is pivot table?

8. List the limitations of broad casting.

9. What are universal functions?

10. List is mutable-justify with suitable examples.

PART- B (13 marks)

1. List the prime numbers between 0 and 100 by using a Boolean array.

[Link] from the array [Link](3,4,6,10,24,89,45,43,46,99,100) with Boolean


masking all the nubbers:

* Which are divisible by 3

* Which are divisible by 5

* Which are divisible by 3 and 5

* Which are divisible by 3 and set them to 42


3. Elaborate on indexing and slicing operations of Numpy arrays

4. Discuss the aggregation operations of Numpy arrays with examples.

5. Explain comparison and masking operations.

6. Assess the benefits of fancy indexing.

PART- C (15 Marks)

1. Discuss the approaches to combine datasets and identify the challenges.

2. Explain the benefits of multiple indexing.

3. Elaborate various methods of handling the missing data in pandas.

UNIT-V

PART-A (2 marks)
1. Write the significance of data visualization

2. Write python code to display a simple plot using matplotlib

3. List the interface supported by matplotlib

4. How can you set different colors for line plot?

5. What is scatter plot?

6. What is histogram?

7. Write the features of seaborn module.

8. What are pair plots?

9. Define density plot.

10. Comment on text transforms.

PART- B (13 marks)

1. Write python program to plot line chart by assuming your own data and explain.

2. Demonstrate the usage of histogram for data exploration and explain its
attributes.

3. Write python program to visualize data set using scatter plots and explain.

4. Write python program to visualize the dataset using scatteplot.

5. Explain in detail about the functions of mpl tool kit for geographic data
visualization.

PART- C (15 Marks)


1. Explain in detail about visualization with sea born.
2. Elaborate the error visualization methods in pyplot.
3. Discuss in detail about 3D plotting functions of matplotlib module.

You might also like