MUST KNOW CONCEPTS
MKC
CSE 2024-25
Course Code & Course Name : CS3352 & Foundations of Data
Science
Year/Sem/Sec : II/III/A
Notation
S.N Concept/Definition/Meaning/
Term (Symbol Units
o Units/Equation/Expression
)
UNIT I- INTRODUCTION
Data science is the area of study
which involves extracting insights
1 Data Science from vast amounts of data using
various scientific methods,
algorithms, and processes.
Machine learning trains the
Machine software model so that it can
2
learning perform the tasks as a human
expert.
The upstream process: Acquiring,
cleaning, integrating
3 Facets of data
The downstream process: analysis,
modelling and prediction.
It is a continuous flow of data from
a source to destination to be
4 Data streaming
processed and analysed in near real
time.
Setting the research goal, retrieving
Data science data, data preparation, data
5
process exploration, data modelling,
presentation and automation.
Any data that has been received,
stored or changed in such a manner
6 Noisy data that it cannot be read or used by
the program that originally created
it can be described as noisy.
Data cleansing is a subprocess of
the data preparation process that
7 Cleansing data
focus on removing errors in the
data.
8 Outliers Outliers is a data object that
deviates significantly from the rest
of the data objects and behaves in a
different manner.
Data transformation is the process
of converting data from one format
Data
9 to another. It is converting raw data
transformation
into clean and usable form,
removing duplicates.
A dummy variable is a numerical
Dummy variable used in regression analysis
10
variables to represent subgroups of the
sample in the study.
D(x,y)= It is the distance between two
sqrt( points. It can be calculated from
Euclidean
11
distance ∑ n ¿ ( yi− xi ) the Cartesian coordinates of the
2
i=1 ¿ points using the Pythagorean
¿
theorem
Exploratory data analysis (EDA) is
used by data scientists to analyze
and investigate data sets and
summarize their main
characteristics, often employing
data visualization methods. It helps
12 EDA
determine how best to manipulate
data sources to get the answers you
need, making it easier for data
scientists to discover patterns, spot
anomalies, test a hypothesis, or
check assumption.
Tools for model R and PL/R, Octave, WEKA, Python,
13
building SQL, MADLib
A data warehouse centralizes and
consolidates large amounts of data
from multiple sources. Its analytical
capabilities allow organizations to
Data derive valuable business insights
14
warehousing from their data to improve decision-
making. Over time, it builds a
historical record that can be
invaluable to data scientists and
business analysts.
Data mining is the process of
sorting through large data sets to
identify patterns and relationships
that can help solve business
15 Data mining problems through data analysis.
Data mining techniques and tools
enable enterprises to predict future
trends and make more-informed
business decisions.
Market and stock analysis, fraud
Data mining
16 detection, risk management,
applications
analysing customer life value.
Data mining Rapid miner, weka, KNime, Apache
17
tools Mahout, Oracle Data mining.
Selection,pre-processing,
Data mining transformation, data mining,
18
steps interpretation, knowledge
extraction.
Statistics, domain expertise, data
Data science engineering, visualization, advances
19
components computing, mathematics, machine
learning.
The definition of big data is data
that contains greater variety,
20 Big data
arriving in increasing volumes and
with more velocity.
Data engineering is the process of
designing and building systems that
let people collect and analyse raw
Data data from multiple sources and
21
Engineering formats. These systems empower
people to find practical applications
of the data, which businesses can
use to thrive
Defined the performance of a
Confusion classification algorithm. It visualizes
22
Matrix and summarizes the performance of
a classification algorithms.
It is a document that lays out the
project vision, scope, objectives,
23 Project charter
project team and their
responsibilities.
Data lake stores an organizations
24 Data lake raw and processed data at both
large and small scales.
Data mart supplies subject oriented
25 Data mart data necessary to support a specific
business unit.
UNIT II: DESCRIBING DATA
Data is a collection of discrete or
continuous values that
convey information, describing
26 Data the quantity, quality, fact, statistics,
other basic units of meaning, or
simply sequences of symbols that
may be further interpreted formally
Nominal data is data that can be
labelled or classified into mutually
27 Nominal data exclusive categories within a
variable. These categories cannot
be ordered in a meaningful way.
28 Ordinal data Ordinal data classifies data while
introducing an order, or ranking. For
instance, measuring economic
status using the hierarchy:
‘wealthy’, ‘middle income’ or ‘poor.’
However, there is no clearly defined
interval between these categories.
a variable is a value that can
change, depending on conditions or
29 Variable on information passed to the
program
Frequency distributions are visual
displays that organise and present
Frequency
30 frequency counts so that the
distribution
information can be interpreted
more easily.
In statistics, an outlier is a data
point that differs significantly from
other observations. An outlier may
31 Outliers
be due to a variability in the
measurement, an indication of
novel data
The cumulative distribution function
gives the probability that the
F(x)
random variable X is less than or
Cumulative
32 equal to x and is usually denoted
distribution F(x)=P[X
F(x). The cumulative distribution
≤x].
function of a random variable X is
the function given by F(x)=P[X≤x].
A graph can be defined as a
pictorial representation or a
diagram that represents data or
33 Graph values in an organized manner. The
points on the graph often represent
the relationship between two or
more things.
A histogram is a graphical
representation of the distribution of
data. The histogram is represented
34 Histogram
by a set of rectangles, adjacent to
each other, where each bar
represents a kind of data.
Collection of data with predictive
35 Table
model
Average is a numeric value in
Mathematics that is used to
represent a large amount of data. It
36 Average
uses a single number to represent
all the other numbers that you
might find in a large data set.
37 Variability Variability, almost by definition,
is the extent to which data points in
a statistical distribution or data set
diverge
Standard deviation is considered to
be a powerful tool to measure
dispersion. Effectively dispersion
38 Standard
means the value by which items
differ from a certain item, in this
case, arithmetic mean.
Mode is said to be one of the
39 Mode measures of central tendency to
determine the value of a set of data
Median is defined as the middle
40 Median value in a given set of numbers or
data
Being four more than thirty.
Synonyms: thirty-four, xxxiv
41 Mean
cardinal. Being or denoting a
numerical quantity but not order.
The range of a data set is the
difference between the greatest
42 Range
value and lowest value within a
collection of numbers.
So, there are 3 quartiles. First
Quartile is denoted by Q1 known as
Interquartile the lower quartile, the second
43
range Quartile is denoted by Q2 and the
third Quartile is denoted by
Q3 known as the upper quartile.
A curve is a shape or a line which is
smoothly drawn in a plane having a
44 Curve bent or turns in it. For example, a
circle is an example of curved-
shape.
A z-score measures exactly how
45 z- score many standard deviations above or
below the mean a data point is.
The standard deviation is the
average amount of variability in
Standard
46 your dataset. It tells you, on
deviation
average, how far each value lies
from the mean.
The degrees of freedom in a
statistical calculation represent how
Degrees of
47 many values involved in a
freedom
calculation have the freedom to
vary.
A Discrete Variable has a certain
Discrete
48 number of particular values and
variable
nothing else.
49 Continuous A continuous variable is defined as
a variable which can take an
variable uncountable set of values or infinite
set of values.
Proportion is simply saying we have
50 Proportion
a relationship between two things.
III- DESCRIBING RELATIONSHIP
The standard error is calculated
SEE=
by dividing the standard deviation
SD/SQRT
Computation of by the sample size's square root. It
(number
51 Standard Error gives the precision of a sample
of
of Estimate mean by including the sample-to-
measure
sample variability of the sample
ment )
means.
R2 is a statistical measure that
determines the proportion of the
Interpretation of
52 variation in the dependent variable
r2
that can be described by the
independent variable.
Multiple It is a method to predict the
53 regression dependent variable with the help of
equations two or more independent variables.
Regression It refers to the tendency for scores,
54 towards the particularly extreme scores to
mean shrink toward the mean.
It is a table which displays the
correlation coefficients for different
Correlation
55 variables. The matrix depicts the
matrix
correlation between all the possible
pairs of value in a table.
It is a term used to describe a
Linear
56 straight line relationship between
relationship
two variables.
Multiple linear regression (MLR),
also known simply as multiple
Multiple Y=mX1+
regression, is a statistical technique
57 regression mX2+mX
that uses several explanatory
equation 3+b
variables to predict the outcome of
a response variable.
Homoscedasticity or homogeneity
Homoscedasticit of variance is an assumption of
58
y equal or similar variances in
different groups being compared.
Regression is used to predict trends
Partial least
59 in data as multiple regression
square
analysis.
It is used to estimate the
Simple linear
60 relationship between two
regression
quantitative variable.
61 Clusters The data points in a scatter plots
form distinct groups. These groups
are called as clusters.
Correlation Pearson correlation, kendall rank
Coefficients correlation, spearman rank
62
based on types correlation, point biserial
of relationships correlation, cramers V correlation
It measures the relationship
63 Correlation
between two variables.
Positive correlation
Types of
64 Negative correlation
correlation
No correlation
Prediction
Need for Validity
65
correlation Reliability
Theory Verification
It is a graph combining a cluster of
66 Scatterplots dots that represents all pairs of
scores.
It indicates one event is the result
of occurrence of the other event
67 Causation
which is referred as cause and
effect.
Relationship between variables
Nonlinear whose scatterplots does not
68
relationship resemble a straight line. It may
resemble a curve or inverted-U
Quadratic relationship
Types of Cubic relationship
69 Nonlinear Exponential relationship
relationship Logarithamic relationship
Cosine relationship
A Data point is called Outlier if it
70 Outlier
does not fit the pattern.
It is the relationship between the
dependent variable and a series of
71 Regression
other variables known as
independent variable.
Types of
Linear Model
72 regression
Non Linear Model
models
Restricted It refers to the range of values that
73
Range has been condensed or shortened.
It shows connection between a data
74 Regression Line sets in a scatterplots which is best
trend of a given datasets.
It occurs whenever regression
Regression
75 towards the mean is interpreted as
Fallacy
real effect, rather than a chance.
IV – PYTHON LIBRARIES FOR DATA WRANGLING
76 Numpy array NumPy is used to work with arrays.
The array object in NumPy is called
ndarray. We can create a NumPy
ndarray object by using the array()
function.
A Python library is a collection of
related modules. It contains
77 Library
bundles of code that can be used
repeatedly in different programs.
Data wrangling ensures data is
reliable and complete before
professionals analyse it and use it to
78 Data wrangling
create insights. Thanks to this
process, those insights are based on
accurate, high-quality data.
Dynamic data or transactional
data is information that is
periodically updated, meaning it
79 Dynamic data
changes asynchronously over time
as new information becomes
available.
Python Lists are just like
dynamically sized arrays, declared
in other languages (vector in C++
80 List and Array List in Java). In simple
language, a list is a collection of
things, enclosed in [ ] and
separated by commas.
Database replication is the frequent
electronic copying of data from a
database in one computer
81 Replication
or server to a database in another --
so that all users share the same
level of information.
A data join is when two data sets
are combined in a side by side
82 Joining manner, therefore at least one
column in each data set must be
the same.
An aggregation is a collection, or
the gathering of things together.
83 Aggregation Your baseball card collection might
represent the aggregation of lots of
different types of cards.
Joining together two or more things
into a large one. In database
parlance, the things being joined
84 Concatenation
are generally two table fields which
may be from the same or different
tables.
85 Scalar The physical quantities which are
specified with the magnitude or size
alone are scalar quantities. For
example, length, speed, work,
mass, density, etc.
Comparison operators in Python,
also called relational operators,
are used to compare two operands.
86 Comparison They return a Boolean True or False
depending on whether the
comparison condition is true or
false.
Boolean logic takes two statements
or expressions and applies a logical
87 Boolean Logic operator to generate a Boolean
value that can be either true or
false.
An index is a method to track the
88 Indexing performance of a group of assets in
a standardized way.
Structured arrays are ndarrays
whose data type is a composition of
89 Structured array
simpler data types organized as a
sequence of named fields.
Data manipulation is the process of
Data arranging a set of data to make it
90
manipulation more organized and easier to
interpret.
Pandas is a fast, powerful, flexible
and easy to use open source data
91 Pandas analysis and manipulation tool,
built on top of
the Python programming language.
A Pandas Data Frame is a 2
dimensional data structure, like a 2
92 Data Frame
dimensional array, or a table with
rows and columns.
To conform DataFrame to a new
Reindexing in Index with optional filling logic,
93
Pandas placing NA/NaN in location having
no value in the previous index.
Pandas Series
94 Pandas Objects Pandas DataFrame
Index
Merge and Join Datasets
Features of Indexing and Subsetting data
95
Pandas Arrays into Multidimensional
data
isnull()
Operations on notnull()
96
Null Values dropnull()
fillna()
97 Combining concat()
Datsets append()
Methods
Relational algebra refers to a
procedural query language that
Relational
98 takes relation instances as input
Algebra
and returns relation instances as
output.
Grouping of data plays a significant
role when we have to deal with
99 Grouping large data. This information can
also be displayed using
a pictograph or a bar graph.
A PivotTable is a powerful tool to
calculate, summarize, and analyze
data that lets you see comparisons,
100 Pivot table patterns, and trends in your
data. PivotTables work a little bit
differently depending on what
platform you are using to run Excel.
V-DATA VISUALIZATION
It is a multiplatform data
101 Matplotlib visualization library built on numpy
arrays.
Interfaces of MATLAB style state based interface,
102
Matplotlib Object oriented interface.
Line plots id used to represent the
103 Line plots relation between two data X and Y
on a different axis.
Scatter() method in the matplotlib
library is used to draw a scatter
plot. Scatter plots are used to
104 Scatter plots
visualize the relation among
variables and how change in one
affects the other variable.
Continuous error bands are a
graphical representation of error or
Continuous
105 uncertainity as a shaded region
errors
around a main trace, rather than as
discrete whisker like error bars.
These are the methods to show a
106 Contour plots three dimensional surface on a two
dimensional plane.
Histogram is a graph showing
107 frequency distribution. Is shows the
Histograms
number of observations within each
given interval.
These are groups of smaller axes
108 subplots that can exist together within a
single figure.
3D plots are enabled by importing
109 3D plotting the mplot3d toolkit, included with
the main matplotlib installation.
Fig=[Link]()
Ax=[Link](projection=’3d’)
Syntax for wire
110 Ax.plot_wireframe(X,Y,Z,color=’red’
frame
)
Ax.set_title(‘wireframe’)
It is a visualization tool to measure
data distributions. It can be
112 Density plot
considered as a smoothed
histogram.
Import numpy as np
X=[Link](0,10,100)
Code for draw
113 Fig=[Link]()
sine &cos wave
[Link](x,[Link](x),’-‘)
[Link](x,[Link](x),’-‘);
Kernel Density Estimation is one of
114 KDE the technique used to smooth a
histogram.
Pseudo It gives better properties near the
115 cylindrical poles of the projection
projection
It projects the map onto a single
cone and is then unrolled. This can
lead to very good local properties,
116 Conic projection
but regions far from the focus point
of the cone may come very
distorted.
The lines of constant latitude and
Cylindrical longitude are mapped to horizontal
117
Projections and vertical lines called as
cylindrical projections.
It is used to project a spherical map
such that of earth, onto a flat
118 Map projections
surface without distorting it or
breaking its continuity.
Setting rcparams at runtime
Customize
119 Using style sheets
matplotlib
Changing your matplotlibrc file
Sequential
Classes of color
Diverging
120 maps in scatter
Cyclic
plot
Qaualitative
[Link]
(x_axis_data, y_axis_data, s=none,
Syntax for c=none,marker=none, cmap=none,
121
scatter() vmin=none, vmax=none,
alpha=none, linewidth=none,
edgecolors=none)
It indicates the estimated error or
122 Error bars uncertainity to show how precise a
measurement/analytical model.
123 Bar plots Used to aggregate the categorical
data according to some methods
and by default it’s the mean.
Factor plots allows to visualize the
distribution of a parameter within
124 Factor plots
bins defined by any other
parameter.
Plot pairwise relationships in a
datasets. This is a high-level
125 Pair plots interface for PairGrid that is
intended to make it easy to draw a
few common styles.
Placement Questions
Data science is the area of study
which involves extracting insights
126 Data Science from vast amounts of data using
various scientific methods,
algorithms, and processes.
A dummy variable is a numerical
Dummy variable used in regression analysis
127
variables to represent subgroups of the
sample in the study.
Tools for model R and PL/R, Octave, WEKA, Python,
128
building SQL, MADLib
It is a document that lays out the
project vision, scope, objectives,
129 Project charter
project team and their
responsibilities.
The definition of big data is data
that contains greater variety,
130 Big data
arriving in increasing volumes and
with more velocity.
Data is a collection of discrete or
continuous values that convey
information, describing the
131 Data quantity, quality, fact, statistics,
other basic units of meaning, or
simply sequences of symbols that
may be further interpreted formally
Nominal data is data that can be
labelled or classified into mutually
132 Nominal data exclusive categories within a
variable. These categories cannot
be ordered in a meaningful way.
Ordinal data classifies data while
133 Ordinal data
introducing an order, or ranking.
A histogram is a graphical
representation of the distribution of
data. The histogram is represented
134 Histogram
by a set of rectangles, adjacent to
each other, where each bar
represents a kind of data.
A continuous variable is defined as
Continuous a variable which can take an
135
variable uncountable set of values or infinite
set of values.
It is the relationship between the
dependent variable and a series of
136 Regression
other variables known as
independent variable.
It measures the relationship
137 Correlation
between two variables.
It is the relationship between the
dependent variable and a series of
138 Regression
other variables known as
independent variable.
Restricted It refers to the range of values that
139
Range has been condensed or shortened.
It shows connection between a data
140 Regression Line sets in a scatterplots which is best
trend of a given datasets.
NumPy is used to work with arrays.
The array object in NumPy is called
141 Numpy array ndarray. We can create a NumPy
ndarray object by using the array()
function.
A Python library is a collection of
related modules. It contains
142 Library
bundles of code that can be used
repeatedly in different programs.
Data wrangling ensures data is
reliable and complete before
143 Data wrangling
professionals analyse it and use it to
create insights.
Pandas is a fast, powerful, flexible
and easy to use open source data
144 Pandas analysis and manipulation tool built
on top of the Python programming
language.
A Pandas Data Frame is a 2
dimensional data structure, like a 2
145 Data Frame
dimensional array, or a table with
rows and columns.
It is a multiplatform data
146 Matplotlib visualization library built on numpy
arrays.
It is a visualization tool to measure
data distributions. It can be
147 Density plot
considered as a smoothed
histogram.
These are the methods to show a
148 Contour plots three dimensional surface on a two
dimensional plane.
It provides an API of matplotlib for
149 Seaborn plot style, color defaults, statistical
plot types in Pandas DataFrame.
Matplot is connected with Numpy
Difference
and Pandas by graphics
between
packages in visualization
150 Matplotlib and
Seaborn is more comfortable in
Seaborn in
handling Pandas DataFrames
Visualization
Faculty Prepared HoD
Principal