Foundations of Data Science — Detailed
Notes & Exam Guide
Syllabus source: Based on Anna University CS3352 syllabus. (Stucor)
Unit I — Introduction
Topics & Key Concepts
1. Data Science: Benefits and Uses
o Definition: a multidisciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge or insights from data
(structured/unstructured).
o Benefits: improved decision-making, predictive analytics, automation,
personalization, cost reduction.
o Use cases: healthcare, finance, marketing, fraud detection, recommendation
systems.
2. Facets of Data
o Volume, Velocity, Variety, Veracity, Value (the 5 Vs).
o Structured vs unstructured vs semi-structured data.
3. Data Science Process / Lifecycle
o Steps:
1. Defining research goals / problem statement
2. Retrieving / collection of data
3. Data preparation / cleaning / transformation
4. Exploratory Data Analysis (EDA)
5. Model building / algorithm selection
6. Evaluation / validation
7. Presentation of findings / deployment / monitoring
o Iteration is common — process often cycles back.
4. Defining Research Goals
o Converting business problem to data problem.
o Determining target variables, hypotheses, metrics (accuracy, RMSE, etc.).
5. Retrieving Data
o Sources: databases, APIs, web scraping, sensors, logs.
o Techniques: SQL queries, REST APIs, crawling, file imports.
6. Data Preparation
o Cleaning: handle missing values, duplicates, erroneous entries.
o Transformations: scaling, encoding categorical variables, feature engineering.
o Integration: merging multiple data sources.
7. Exploratory Data Analysis (EDA)
o Summary statistics, distributions, visual inspection, correlation matrices.
o Purpose: understand structure, detect outliers, test assumptions, guide modeling.
8. Building the Model / Diagnostics / Application
o Choose algorithms, hyperparameter tuning, training/testing split, cross-validation.
o Diagnostic checks (residuals, overfitting, underfitting).
o Deploying model, interpretability, retraining cycle.
9. Data Mining & Data Warehousing
o Data Mining: extracting patterns (classification, clustering, association).
o Data Warehousing: storing large integrated data, ETL process, OLAP, schema
designs (star, snowflake).
10. Basic Statistical Descriptions of Data
o Measures of central tendency (mean, median, mode)
o Measures of dispersion (variance, standard deviation, range, IQR)
o Skewness, kurtosis
How Exam Questions Are Posed & Sample Answers
Short (2-marks) Type:
Define data science and mention two benefits.
What are the 5 Vs of data?
Name the steps in the data science process.
Long (16-marks) Type:
“Explain the full data science lifecycle in detail, with a real-world example and
diagram.”
“Discuss the differences between data mining and data warehousing, and how they
integrate into a data science workflow.”
Model Answer Outline for Long Question:
Introduce the concept of a lifecycle, list steps.
Use a real example (e.g. predicting customer churn): show how each step applies.
Diagram labeling steps with arrows.
In the mining vs warehousing question: define each, give functions, pros/cons, and how a
data warehouse feeds mining tasks.
Unit II — Describing Data
Topics & Key Concepts
1. Types of Data
o Quantitative (numerical) vs Qualitative (categorical)
o Within quantitative: discrete vs continuous
o Within qualitative: nominal vs ordinal
2. Types of Variables
o Independent / dependent (in modeling)
o Explanatory / response
o Binary, categorical, interval, ratio variables.
3. Describing Data with Tables & Graphs
o Frequency tables, contingency tables (cross-tabulation)
o Graphs: histograms, bar charts, pie charts, box plots, stem-leaf plots.
4. Describing Data with Averages
o Arithmetic mean, geometric mean, harmonic mean (if relevant)
o Weighted mean, trimmed mean
o When to use which average.
5. Describing Variability
o Range, variance, standard deviation, inter-quartile range (IQR)
o Coefficient of variation (CV) = SD / mean.
6. Normal Distribution & z-Scores
o Normal (Gaussian) curve, properties (symmetric, mean = median = mode)
o Standard normal distribution (mean 0, variance 1)
o z-score formula: ( z = \frac{x - \mu}{\sigma} )
o Using z to compare across distributions, compute probabilities.
How Exam Questions Are Posed & Sample Answers
Short Qs:
What is the difference between nominal and ordinal data?
What is IQR?
Formula for z-score.
What graphs would you use for categorical vs numeric data?
Long Qs:
“Given a dataset (or class frequencies), compute mean, median, variance, standard
deviation and interpret results.”
“Explain normal distribution, properties, and how z-scores help in standardizing data.”
“Demonstrate how to represent data using tables and different graphical methods with
examples.”
Model Answer Outline:
Present formulas with definitions.
Provide a small example dataset and walk through calculations.
Draw diagrams (normal curve, box plot).
Interpret meaning (spread, outliers).
Unit III — Describing Relationships
Topics & Key Concepts
1. Correlation & Scatter Plots
o Scatter plot to visualize relationship
o Positive, negative, zero correlation
o Causation vs correlation
2. Correlation Coefficient (Pearson r)
o Formula:
[
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum
(y_i - \bar{y})^2}}
]
o Properties: range [-1, +1], sign, strength.
o Interpret correlation magnitude.
3. Regression
o Simple linear regression: ( y = a + b x )
o Derivation of ( b ) and ( a ):
[
b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}, \quad a = \
bar{y} - b \bar{x}
]
o Prediction, residuals.
4. Standard Error of Estimate
o Measures dispersion of observed vs predicted values.
o Formula:
[
S_e = \sqrt{\frac{\sum (y_i - \hat{y}_i)^2}{n - 2}}
]
5. Interpretation of ( R^2 )
o Proportion of variance in dependent variable explained by model: ( R^2 = 1 - \
frac{SS_{\text{res}}}{SS_{\text{tot}}} ).
o Limitations — overfitting, spurious correlation.
6. Multiple Regression
o Regression model with multiple predictors:
[
y = a + b_1 x_1 + b_2 x_2 + \cdots + b_k x_k
]
o Estimation via matrix methods (normal equations) or software.
7. Regression toward the Mean
o Concept: extreme values tend to regress to average on repeated measures.
o Known in psychology, SAT scores, sports metrics.
How Exam Questions Are Posed & Sample Answers
Short Qs:
Define correlation.
State the formulas for (b) and (a).
What is ( R^2 )?
What is standard error of estimate?
Long Qs:
“Given dataset x, y, compute Pearson correlation, regression line, predict new y,
compute error (residual) and give interpretation.”
“Explain multiple regression, its assumptions, advantages, and example usage.”
“Discuss regression toward the mean with an example.”
Model Answer Outline:
Start with scatter plot and correlation concept.
Derive regression coefficients with formulas, show computations.
Show prediction and residual analysis.
Compute ( R^2 ) and discuss meaning.
For multiple regression, present matrix form, assumptions (linearity, independence,
homoscedasticity, no multicollinearity) and example use-case.
Unit IV — Python Libraries for Data Wrangling
Topics & Key Concepts
1. Basics of NumPy Arrays
o Creation: [Link], [Link], [Link], [Link], [Link].
o Attributes: shape, dtype, ndim.
2. Aggregations & Computations on Arrays
o Functions: sum, mean, min, max, std, var.
o Axis parameter.
3. Comparisons, Masks, Boolean Logic
o Element-wise comparisons (>, <, ==).
o Masking: arr[arr>threshold].
4. Fancy Indexing
o Selecting subsets with arrays of indices.
o Boolean indexing, integer indexing.
5. Structured Arrays
o Arrays with named fields (heterogeneous types).
o Example dtype specification.
6. Pandas: Indexing & Selection
o [Link][], [Link][], slicing.
o [Link][], [Link][].
7. Operating on Data
o Arithmetic operations, broadcasting, applying functions (apply, map, applymap).
8. Missing Data Handling
o [Link](), [Link](), [Link]().
o Interpolation, forward/backward fill.
9. Hierarchical Indexing (MultiIndex)
o Creating multi-level indices, selecting cross-sections.
10. Combining Datasets
o Concatenation ([Link]), merging ([Link]), joins.
11. Aggregation & Grouping
o groupby, aggregation functions, pivot tables.
12. Pivot Tables
o df.pivot_table() with index, columns, values, aggfunc.
How Exam Questions Are Posed & Sample Answers
Short Qs:
What is the difference between loc and iloc?
How do you fill missing values?
Define fancy indexing.
What is hierarchical indexing?
Long Qs:
“Write a Python script using NumPy and Pandas to load a dataset, handle missing
values, group by a categorical column and compute summary statistics, then pivot the
result.”
“Explain fancy indexing and structured arrays with examples.”
“Discuss pitfalls in data wrangling and how Pandas helps.”
Model Answer Outline:
Provide code snippets with explanations.
Show a dataset (small example) and run grouping, merging, pivot.
Illustrate boolean indexing, masks.
Explain time complexity and pitfalls (chain operations, SettingWithCopyWarning).
Unit V — Data Visualization
Topics & Key Concepts
1. Importing & Basic Matplotlib
o import [Link] as plt
o Basic plot: [Link](x, y), [Link]().
2. Line & Scatter Plots
o Plotting trends vs relationships.
o Adding markers, line styles, labels.
3. Visualizing Errors
o Plot residuals, error bars ([Link]).
4. Density & Contour Plots
o KDE plots (using Seaborn)
o Contour plots from grid data ([Link]).
5. Histograms
o [Link]() or [Link](), bins, density normalization.
6. Legends, Colors, and Customization
o [Link](), color, linewidth, style.
7. Subplots
o [Link]() / [Link]() for multiple plots.
8. Text & Annotation
o [Link](), [Link]() for labeling points.
9. Customization
o Limits ([Link], [Link]), grid, ticks, style.
10. 3D Plotting
o Axes3D for surface, wireframe.
11. Geographic Data with Basemap
o Using Basemap (or newer Cartopy) to plot maps, overlay data points.
12. Visualization with Seaborn
o Higher-level plots: [Link], [Link], [Link],
[Link].
How Exam Questions Are Posed & Sample Answers
Short Qs:
How do you add a legend in Matplotlib?
What is a contour plot?
Name one Seaborn function for pairwise relationships.
What's the difference between scatter and line plot?
Long Qs:
“Given a dataset, produce multiple visualizations (histogram, scatter, contour),
customize with labels, annotations, subplots, and discuss insights.”
“Explain how to visualize geospatial data using Basemap/Cartopy with an example.”
“Discuss pros and cons of Matplotlib vs Seaborn; when to use each.”
Model Answer Outline:
Present Python code with matplotlib and seaborn.
Show sample outputs (embedded plots).
Explain axes, labels, aesthetics, insights.
Discuss mapping plots (latitude/longitude, projection, map overlays).
Mention performance, style, clarity tradeoffs.
Final “Exam Tips & Strategy” Notes
Always state definitions/formulas clearly in theory questions.
In calculations show each step (deviations, sums, substitution).
Use small illustrative datasets where needed (5–6 points) rather than large ones.
In coding questions, write clean and commented code (imports, function definitions).
For visualization, always include title, axis labels, legends.
In regression/relationships, interpret slope, intercept, ( R^2 ) in context.
Do error checking: missing values, outliers, assumptions of model.
Time management: do easy ones first (Part A), then allocate 15–20 minutes per long
answer.
Foundations of Data Science – Long Answer
Notes
Unit 1 – Introduction
Q1. Explain the Data Science Lifecycle with a neat diagram and real-world
example.
Answer:
Steps in the Data Science Lifecycle:
1. Define the Problem / Research Goals – translate business objective into a data
science problem.
2. Data Collection – collect raw data from sensors, databases, APIs.
3. Data Preparation – cleaning, missing value imputation, transformation,
integration.
4. Exploratory Data Analysis (EDA) – visualize data, detect patterns, test
hypotheses.
5. Model Building – select suitable algorithms (regression, classification,
clustering).
6. Model Evaluation – use metrics (accuracy, RMSE, R², precision, recall).
7. Deployment & Monitoring – put the model into production and continuously
monitor.
Diagram:
Problem → Data Collection → Preparation → EDA → Modeling → Evaluation →
Deployment
(circular arrows showing iteration)
Example: Predicting credit card fraud – define fraud detection, collect transaction logs, clean
anomalies, analyze distributions, build classification model, test accuracy, deploy model in bank
system.
Q2. Differentiate Data Mining and Data Warehousing.
Answer:
Aspect Data Mining Data Warehousing
Process of extracting hidden patterns & Centralized repository for storing
Definition
knowledge from large datasets integrated historical data
Prediction, classification, clustering,
Purpose Efficient storage, query, and analysis
association
ETL (Extract, Transform, Load),
Techniques Clustering, Regression, Classification
OLAP
Example Market basket analysis, fraud detection Sales data warehouse in retail
Diagram:
Data Sources → ETL → Data Warehouse → Data Mining → Knowledge Discovery
Unit 2 – Describing Data
Q3. Explain measures of central tendency and dispersion with examples.
Answer:
Central Tendency:
o Mean (μ):
[
\mu = \frac{\sum x_i}{n}
]
Example: marks [10, 20, 30] → mean = 20.
o Median: middle value (25, 30, 40 → median = 30).
o Mode: most frequent value (5, 5, 6, 7 → mode = 5).
Dispersion:
o Range = Max – Min
o Variance (σ²):
[
\sigma^2 = \frac{\sum (x_i - \mu)^2}{n}
]
o Standard Deviation (σ) = √Variance
o Interquartile Range (IQR) = Q3 – Q1
Diagram: Box plot (showing median, quartiles, outliers).
Q4. What is a Normal Distribution? Explain properties and use of z-score.
Answer:
Normal Distribution: Bell-shaped, symmetric around mean.
Properties:
o Mean = Median = Mode.
o Area under curve = 1.
o 68–95–99.7 rule (1σ, 2σ, 3σ intervals).
z-score formula:
[
z = \frac{x - \mu}{\sigma}
]
Usage: Standardize values, compare across distributions, probability computations.
Diagram: Bell curve with μ at center, ±σ intervals marked.
Unit 3 – Describing Relationships
Q5. Define correlation and regression. How is regression line derived?
Answer:
Correlation (r): Strength of linear relationship between x and y.
[
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \
bar{y})^2}}
]
Regression Equation:
[
y = a + bx
]
Slope (b):
[
b = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sum (x_i - \bar{x})^2}
]
Intercept (a):
[
a = \bar{y} - b\bar{x}
]
Interpretation: Predict y given x, check residuals.
Diagram: Scatter plot with best-fit regression line.
Q6. Explain Multiple Regression with an example.
Answer:
Equation:
[
y = a + b_1x_1 + b_2x_2 + \dots + b_kx_k
]
Example: Predicting house price using (area, bedrooms, age).
Interpretation: Each coefficient shows marginal effect while holding others constant.
Assumptions: linearity, independence, no multicollinearity, homoscedasticity.
Diagram: Plane fitting data points (for 2 predictors).
Unit 4 – Python Libraries for Data Wrangling
Q7. Write a Python program to handle missing values in a dataset.
Answer:
import pandas as pd
# Sample data
data = {'Name': ['A', 'B', 'C', 'D'],
'Marks': [90, None, 75, 88]}
df = [Link](data)
print("Original Data:\n", df)
# Fill missing with mean
df['Marks'].fillna(df['Marks'].mean(), inplace=True)
print("\nAfter Handling Missing Values:\n", df)
Output:
Original Data:
Name Marks
0 A 90.0
1 B NaN
2 C 75.0
3 D 88.0
After Handling Missing Values:
Name Marks
0 A 90.0
1 B 84.3
2 C 75.0
3 D 88.0
Explanation: fillna() replaces missing values with mean.
Q8. Explain Pandas GroupBy and Pivot Table with code.
Answer:
import pandas as pd
data = {'Dept': ['CSE','CSE','ECE','ECE','EEE'],
'Marks':[80, 90, 70, 75, 85]}
df = [Link](data)
# GroupBy
grouped = [Link]('Dept')['Marks'].mean()
print(grouped)
# Pivot Table
pivot = df.pivot_table(values='Marks', index='Dept', aggfunc='mean')
print(pivot)
Output:
Dept
CSE 85.0
ECE 72.5
EEE 85.0
Name: Marks, dtype: float64
Explanation: Both methods summarize average marks department-wise.
Unit 5 – Data Visualization
Q9. Write a program to plot a histogram and boxplot for student marks.
Answer:
import [Link] as plt
import seaborn as sns
marks = [55, 60, 65, 70, 75, 80, 85, 90, 95, 100]
# Histogram
[Link](marks, bins=5, color='skyblue', edgecolor='black')
[Link]("Histogram of Marks")
[Link]("Marks")
[Link]("Frequency")
[Link]()
# Boxplot
[Link](marks)
[Link]("Boxplot of Marks")
[Link]()
Diagram: Histogram with bins, Boxplot with median line and whiskers.
Q10. Explain Scatter Plot, Contour Plot, and 3D Surface Plot with code.
Answer:
import numpy as np
import [Link] as plt
from mpl_toolkits.mplot3d import Axes3D
# Scatter
x = [Link](50)
y = [Link](50)
[Link](x, y, color='red')
[Link]("Scatter Plot")
[Link]()
# Contour
X, Y = [Link]([Link](-3,3,100), [Link](-3,3,100))
Z = X**2 + Y**2
[Link](X, Y, Z, cmap='viridis')
[Link]("Contour Plot")
[Link]()
# 3D Surface
fig = [Link]()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis')
[Link]("3D Surface Plot")
[Link]()
Diagram: Scatter (red dots), Contour (filled levels), 3D surface (curved bowl).
Foundations of Data Science – Complete
Notes (Anna University)
UNIT I – INTRODUCTION
Q1. Explain the Data Science process with a neat diagram and example.
Answer:
The Data Science process is the structured workflow of solving real-world problems using data.
Steps:
1. Problem Definition – Translate business problem into a data problem.
2. Data Collection – From databases, sensors, APIs, web scraping.
3. Data Preparation – Cleaning, handling missing values, transformations.
4. Exploratory Data Analysis (EDA) – Descriptive statistics, visualization.
5. Modeling – Statistical/ML models.
6. Evaluation – Accuracy, precision, recall, RMSE, R².
7. Deployment & Monitoring – Integrating into production systems.
Diagram:
↖-----------------------------------------------↙
Problem → Collect Data → Clean Data → EDA → Model → Evaluation → Deployment
Example: Predicting customer churn for a telecom company.
Q2. Compare Data Mining and Data Warehousing.
Answer:
Aspect Data Mining Data Warehousing
Definition Process of discovering hidden patterns Repository to store integrated data
Purpose Knowledge discovery, prediction Querying, OLAP, reporting
Techniques Classification, clustering, association ETL, schema design (star/snowflake)
Example Market basket analysis Sales data warehouse
Diagram:
Data Sources → ETL → Data Warehouse → Data Mining → Knowledge
Q3. Write short notes on Structured, Semi-structured, and Unstructured data.
Answer:
Structured: Tabular, relational DB (e.g., bank transactions).
Semi-structured: XML, JSON logs.
Unstructured: Images, video, text (social media).
Q4. Discuss benefits and challenges of Data Science.
Answer:
Benefits:
o Better decision-making
o Cost reduction
o Personalized recommendations
o Fraud detection
Challenges:
o Data privacy
o Handling large unstructured data
o Model interpretability
Q5. Explain Big Data characteristics (5 Vs).
Answer:
1. Volume – Large size (TB, PB).
2. Velocity – High speed data streams.
3. Variety – Text, video, audio, logs.
4. Veracity – Data uncertainty/noise.
5. Value – Useful insights.
Diagram: 5V pyramid.
UNIT II – DESCRIBING DATA
Q6. Explain measures of central tendency and dispersion with example.
Answer:
Central Tendency:
o Mean = average
o Median = middle value
o Mode = most frequent
Dispersion:
o Range = Max – Min
o Variance ( \sigma^2 = \frac{\sum (x_i-\mu)^2}{n} )
o Std deviation ( \sigma = \sqrt{\sigma^2} )
o IQR = Q3 – Q1
Diagram: Boxplot showing quartiles.
Q7. Explain Normal Distribution and Standardization.
Answer:
Normal Distribution: Bell-shaped, symmetric.
Empirical Rule: 68% (±1σ), 95% (±2σ), 99.7% (±3σ).
z-score:
[
z = \frac{x-\mu}{\sigma}
]
Helps compare across distributions.
Diagram: Bell curve with ±1σ, ±2σ, ±3σ.
Q8. Explain different data visualization techniques for describing data.
Answer:
Histogram → frequency distribution
Boxplot → spread, outliers
Bar Chart → categorical comparison
Scatter plot → relationships
UNIT III – DESCRIBING RELATIONSHIPS
Q9. Explain correlation with formula and interpretation.
Answer:
Correlation measures strength & direction of relationship between variables.
Pearson’s r:
[
r = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2 \sum(y_i-\
bar{y})^2}}
]
Values:
o r = +1 → Perfect positive
o r = -1 → Perfect negative
o r = 0 → No correlation
Diagram: Scatter plots (positive, negative, none).
Q10. Derive the regression line equation.
Answer:
Simple linear regression:
[
y = a + bx
]
where,
[
b = \frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}
]
[
a = \bar{y} - b\bar{x}
]
Diagram: Line fitting scatter points.
Q11. Explain Multiple Linear Regression with example.
Answer:
[
y = a + b_1x_1 + b_2x_2 + ... + b_kx_k
]
Example: Predicting house price using (area, bedrooms, age).
Uses: Finance, healthcare, marketing.
Q12. Explain concept of causation vs correlation.
Answer:
Correlation ≠ causation.
Example: Ice cream sales and drowning incidents → correlated but not causal.
Causation requires controlled experiments.
UNIT IV – DATA WRANGLING USING PYTHON
Q13. Explain Pandas Series and DataFrame with examples.
Answer:
import pandas as pd
s = [Link]([10,20,30])
print(s)
df = [Link]({'Name':['A','B','C'], 'Marks':[85,90,95]})
print(df)
Output: Series (1D), DataFrame (2D table).
Q14. Explain handling missing data in Pandas.
Answer:
dropna() – remove rows
fillna(value) – replace with mean/median
interpolate() – estimate missing values
Q15. Explain GroupBy and Aggregation with example.
Answer:
df = [Link]({'Dept':['CSE','CSE','ECE'], 'Marks':[90,85,70]})
print([Link]('Dept')['Marks'].mean())
Output: Dept-wise average.
UNIT V – DATA VISUALIZATION
Q16. Explain Histogram, Boxplot, Scatter plot with code.
Answer:
import [Link] as plt
import seaborn as sns
marks = [55,65,70,75,80,90,95,100]
[Link](marks, bins=5)
[Link]("Histogram")
[Link]()
[Link](marks)
[Link]("Boxplot")
[Link]()
Q17. Write Python code for line plot and bar plot.
Answer:
import [Link] as plt
x = [1,2,3,4,5]; y = [2,4,6,8,10]
[Link](x,y,marker='o'); [Link]("Line Plot"); [Link]()
[Link](['A','B','C'],[10,20,15])
[Link]("Bar Chart"); [Link]()
Q18. Explain Contour and 3D Surface Plots with example.
Answer:
import numpy as np
import [Link] as plt
X,Y = [Link]([Link](-3,3,100), [Link](-3,3,100))
Z = X**2 + Y**2
[Link](X,Y,Z); [Link]("Contour"); [Link]()
from mpl_toolkits.mplot3d import Axes3D
fig = [Link](); ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X,Y,Z,cmap='viridis'); [Link]()