Introduction To Data Science
Introduction To Data Science
5th Semester-
University Of Science & Technology Bannu
Prepared By:
CodeZdaka Team (USTB)
Compiled by…
Date: Nov-2025
Types:
Can be divide into three category:
Database:
A database is an organized collection of related data that can be easily stored, managed, and retrieved when needed.
OR
The collection of logically interrelated data that can be used or shared by many users according to their need.
A database is managed using software called a DBMS (Database Management System).
There are two main kinds of databases:
Relational Database – stores data in connected tables (e.g., MySQL, PostgreSQL).
Non-Relational (NoSQL) Database – stores data in documents or key-value form (e.g., MongoDB).
Data Warehouse:
A data warehouse is a large central system that stores Historical data from many different sources for analysis,
reporting, and decision-making.
Unlike a normal database (used for daily work), a data warehouse is used to study data, find patterns, and make
smart decisions.
Purpose Used for daily operations and Used for data analysis, reporting, and decision-
transactions making
Data Type Stores current (real-time) data Stores historical (past) data
Data Source Usually one main source Combines data from many sources
Data Usage For adding, updating, and deleting data For analyzing, summarizing, and reporting data
Users Used by operational staff and developers Used by analysts and managers
Data Structure Highly normalized (many small linked Denormalized (combined tables for fast
tables) analysis)
Data science:
Data science is the interdisciplinary field that uses scientific methods, mathematics, and computer tools to collect,
clean, analyze, and understand data to make better decisions.
Interdisciplinary means data science uses ideas from many subjects together.
1. Mathematics
2. Statistics
Python
Algorithms
Data structures
Libraries (Numpy, pandas, sklearn, matplotlib, seaborn)
4. Domain Knowledge
Data science is all about turning raw data into useful knowledge.
A data scientist collects the data, cleans it (removes errors), analyzes it, and explains what it means in a way that
helps others make decisions.
Data Science Hype:
Data Science Hype means the period when data science became extremely popular, and people started thinking it
can solve almost every problem.
Companies, media, and universities promoted data science as a magical field, creating very high expectations.
2. Data Scientists work with this data to help companies understand it and use it to make proper decisions.
3. Companies use a data-driven approach, where large amounts of data are analyzed to find useful insights.
These insights help companies know how they are performing in the market and how to improve.
4. Data Science is not only used in business. Healthcare industries also use it to detect problems like tumors
or deformities at an early stage, which helps in faster diagnosis and better treatment.
Big Data:
Big Data means extremely large datasets that are too big, too fast, or too complex for traditional data processing
tools to handle. It is a core part of data science because modern systems generate massive amounts of data every
second.
• Velocity: Data is generated very fast and must be processed quickly. Example: Stock market updates every
millisecond, live sensor data from cars and IoT devices.
• Variety: Data comes in many forms – structured (tables, numbers), semi-structured (JSON, logs), and
unstructured (images, videos, text).
• Veracity: Data may contain errors or noise. Big Data systems must clean and filter it.
• Value: The goal is to extract useful insights. Example: Amazon recommends products, hospitals predict
disease risks.
• Big data technologies, such as Hadoop and Spark, provide the tools and techniques needed to store, process,
and analyze large and complex data sets.
• Machine learning algorithms and other data science techniques are also used to extract insights from big
data.
• The insights obtained from big data can help businesses and organizations make informed decisions,
optimize their operations, and improve their products and services.
DataFication:
• Datafication means turning different parts of the world—people, places, things, and activities—into digital
data.
• This process includes collecting, storing, and analyzing large amounts of data so it can be used to gain
insights, make decisions, and create value.
• The rise of digital technologies and more connectivity between people, devices, and systems has made
datafication grow rapidly.
Examples:
IoT devices and sensors collect huge amounts of data about the physical world.
Social media and e-commerce platforms collect data about consumer behavior and preferences.
Datafication is important because it provides the raw data that can be analyzed to make smarter decisions.
Current landscape of perspectives in Data Science means the overall modern view of Data Science, seen from
different important angles.
Data Visualization:
Data visualization is an essential perspective because it allows analysts and stakeholders to explore and
understand data easily. Tools like Tableau and Power BI help present data in a clear, visual format.
New techniques, such as augmented reality (AR) and virtual reality (VR), are being developed to make
data exploration more interactive and engaging, helping people see patterns and trends more clearly.
Interdisciplinary Collaboration:
Data Science is not just about coding or statistics; it requires collaboration between people from different
backgrounds. This includes statisticians, computer scientists, domain experts, and business analysts.
Working together ensures that data is analyzed correctly, interpreted accurately, and the insights derived are
applied effectively to solve real problems.
Statistical Analysis:
A strong foundation in statistics is very important.
This includes knowledge of probability, regression analysis, hypothesis testing, and experimental
design.
Statistics helps data scientists understand patterns in data, make predictions, and validate their
findings.
Machine Learning:
Machine learning is a critical skill because it enables data scientists to build predictive models and
intelligent algorithms.
They should know different types of machine learning algorithms, including supervised learning (where
data has labels) and unsupervised learning (where data has no labels).
Machine learning helps in solving real-world problems like predicting customer behavior, detecting
fraud, or recognizing images and speech.
Data Visualization:
Visualizing data is essential for communicating insights clearly to stakeholders.
Data scientists should know how to use tools like Tableau, Power BI, or Matplotlib to create charts,
graphs, and dashboards.
Visualization helps make complex data easier to understand and supports data-driven decision making.
Communication Skills:
Data scientists need to explain complex technical concepts in simple terms to non-technical stakeholders.
Excellent written and verbal communication is necessary to present insights clearly and effectively.
Communication helps ensure that data insights are understood and used in decision-making.
Creativity:
Creativity is important because data scientists often need to find innovative solutions to complex
problems.
They should be able to look at problems from different angles and think outside the box.
Creative thinking helps in discovering hidden patterns in data and developing new approaches for
analysis.
CHAPTER 2 (EDA)
Discrete Data:
Think of discrete data as things we can count one by one. They are separate numbers, usually whole
numbers, and we cannot split them into smaller [Link] will always in integer form.
o Examples:
Number of students in a class, we can’t say 10.5 student.
Number of cars in a parking lot
Goals scored in a football match
o How we show it: Bar charts, histograms, tables with counts
Continuous Data:
Continuous data is like measuring something, not counting it. It can take any value, even fractions or
decimals, and can be very precise.
o Examples:
Temperature (36.6°C)
Weight (70.5 kg)
Time to finish a race
o How we show it: Line plots, scatter plots, box plots
Interval Data
o Numbers have equal steps between them, but zero does not mean nothing.
o Example: Temperature in Celsius, calendar dates, time of day
o We can add or subtract, but cannot multiply or divide meaningfully
Ratio Data
o Numbers have a true zero, which means nothing exists at zero.
o Example: Weight, height, distance, time
o We can add, subtract, multiply, divide
Nominal
o These are categories without any order.
o We cannot say one category is higher or lower than another.
o Examples: Colors (red, blue, green), Gender (male, female), Blood type (A, B, O, AB).
o Use for counting how many items are in each category or for pie charts and bar graphs.
Ordinal
o These are categories with a meaningful order, but the difference between them is not exact.
o Example: Customer satisfaction (poor, average, good, excellent), Education level (high school,
bachelor, master, PhD).
o we know the order matters, but you cannot measure how much bigger or better one category is
compared to another.
Types of
Data
Quantitative Qualitative
Data Data
Interval
Scale Ratio Scale Nominal Ordinal
Exploratory Data Analysis (EDA):
Exploratory Data Analysis (EDA) is the process where we study and explore data to understand patterns,
mistakes, missing values, and relationships before doing any model or deep analysis.
We do EDA to make sure the data is clean, correct, and ready for machine learning.
We don’t start with complex math in EDA.
We just look, explore, and understand that “what’s going on inside”.
Basic tools and techniques in EDA are the simple methods we use to look at data, clean data, and
understand data before doing advanced analysis. Some best tool and techniques are following:
Descriptive statistics.
Data visualization.
Histogram:
is a graph that shows how a numeric column is spread.
Box plot:
Scatter plot:
Bar chart:
is used for category data.
Heatmap:
shows values using colors.
Cross-tabulation:
is used for two category columns.
• We make a table that shows how often each pair of categories appears.
• This helps us see the relationship between two categorical variables.
Outlier detection:
• It is process of finding values that are far away from the rest.
• These values can affect our analysis.
• We can find outliers using z-score, Tukey’s method, or simply by looking at a box plot.
Data transformation.
Philosophy of EDA:
Philosophy of EDA means the basic way of thinking while exploring data.
It says that we first understand the data through simple checks and visuals before doing any big model.
When we do EDA, we try to “see” the data clearly.
We use simple plots.
We stay curious.
We move step by step.
We let the data guide us.
The core principles of EDA:
Iterative process:
• EDA is not done only once.
• We explore, then we think, then we explore more.
• New questions come.
• We make new plots.
• Slowly, we understand the data deeper.
Holistic understanding:
Learning is how humans, animals, and some machines gain knowledge or skills. It happens when we study,
practice, observe, or are taught.
According to Herbert Simon, learning is any process by which a system improves performance from
experience.
According to Tom Mitchell (1998), ML studies algorithms that improve performance (P) at a task (T)
with experience (E). A learning task can be defined as <P, T, E>.
Instead of fixed rules, ML algorithms analyze data, find patterns, and adapt to changes.
ML is used in many areas like recommendation systems, self-driving cars, voice assistants, and spam
detection.
ML is making machines smarter by learning from data and experience, not just following instructions.
ML Uses when:
Human expertise does not exist (e.g., navigation on Mars)
Humans cannot explain their expertise (e.g., speech recognition)
Models must be customized for specific tasks
Models are based on huge amounts of data
We use ML when problems are too complex, data is large, or human knowledge is limited.
Machine Learning:
o We provide data + output → algorithm learns rules → predicts new output.
o Machine finds patterns and improves from experience.
o Works well for complex problems where rules are hard to write.
o Example: Email spam filter learns from examples of spam emails.
show ads to the right people So, commercial motivation means using machine
suggest products people may like learning to:
keep customers for a long time
increase profit
reduce cost
improve business decisions
Unsupervised learning:
Reinforcement learning:
Reinforcement Learning
Examples:
Classification:
Classification is used when the output is a class. A class is a group or label. Each record belongs to
one class only.
Classification helps us group data. It helps us find behavior patterns.
Example:
We look at:
training set
test set
Training set:
The training set is used to build the model. Each record has attributes. One attribute is the
class label.
The model learns how attributes relate to the class.
Test set:
Unsupervised Learning:
Clustering:
Clustering means making groups. We give the model many data points. Each point has some
features.
The model checks similarity. Similar points go into the same group. Different points go into
different groups.
Inside one group, points are close. Between groups, points are far. That is the main idea.
Simple and useful.
To measure closeness, we often use distance. For numeric data, we use Euclidean distance.
That means normal distance.
We commonly use:
K-Means
Hierarchical Clustering
DBSCAN
A very common use is documents. We give many documents to the model. The model looks
at important words. Documents with similar words go together.
Search systems use this. When we search, similar documents appear. Google Scholar is one
example.
Association:
This also comes under unsupervised learning. Here we do not make groups. Instead, we find
relationships.
For example, customers buy items. Some items are often bought together. This is very useful for
stores.
Stores place these items close. Sales increase. Smart use of data.
We commonly use:
Apriori
FP-Growth
Anomaly Detection:
Anomaly detection is about finding something unusual. The data has no labels.
So we do not know what is normal or abnormal at first.
The model looks at the data. It learns what is normal behavior. Anything very different is marked
as an anomaly. An anomaly is also called an outlier.
What is an anomaly:
It happens rarely.
But it is very important.
We give the model normal data. The model learns the normal pattern. Then new data comes.
If it fits the pattern, it is normal. If it is far from the pattern, it is an anomaly.
Distance, density, or pattern difference is used. No class labels are needed.
Algorithms used for anomaly detection:
Isolation Forest
One-Class SVM
Local Outlier Factor (LOF)
Autoencoders
Reinforcement Learning:
Reinforcement learning is different. Here an agent learns by trying. No teacher gives correct answers.
The agent takes an action. Then it gets a reward or punishment. The reward may come later.
So learning takes time.
model-free
model-based
Q-Learning
SARSA
Deep Q-Network (DQN)
Types:
Model:
Y = a + bX
Where:
Y = dependent variable
X = independent variable
a = intercept
b = slope
Slope:
𝐧∑(𝐱𝐲) − (∑𝐱∑𝐲)
𝐛=
𝒏∑(𝐱)𝟐 − (∑𝐱)𝟐
Intercept:
𝑎 = yˉ − bxˉ
∑y − b∑x
𝑎=
n
𝑌 = a + b1 X1 + b2 X2 . . . . bn Xn
K Nearest Neighbor:
K Nearest Neighbor is a supervised machine-learning algorithm.
It is used for:
Classification problems
Regression problems
KNN Steps:
1. Choose a number K
2. Find the K nearest points to the new point
3. Check their classes
4. The majority class becomes the answer
What is K?
K is the number of nearest neighbors we check. Usually K is an odd number to avoid tie.
Example:
Formula:
For N-dimensional:
Distance = √∑𝑛𝑖=1(𝑞𝑖 − 𝑝𝑖 )2
Where “p” are features of one point and “q’ are features of 2nd point.
Important Note:
1. For Classification, we look at Majority voting, new point goes to class of majority voting.
2. For Regression, Instead of majority voting, we take average.
Naive Bayes:
Naive Bayes is a supervised machine-learning algorithm.
Text classification
Spam detection
Sentiment analysis
P(B|A). P(A)
P(A|B) =
P(B)
Where:
Where:
It assumes that all features are independent. That means Feature 1 does not affect Feature 2. This
assumption is not always true in real life. That is why it is called naive.
Step-by-Step:
Suppose we choose K = 2.
2. Step 2: Calculate distance of each data point from both centroids. Assign each point to nearest
centroid.
3. Step 3: Now calculate new centroid by taking mean of all points in that cluster.
Centroid formula:
For cluster with n points:
x1 + x2 + … + xn
Centroid x =
n
y1 + y2 + … + yn
Centroid y =
n
Advantages:
Simple
Fast
Easy to implement
Disadvantages:
Need to choose K
Sensitive to outliers
Works best with round shaped clusters
CHAPTER 4 (F EATURE E NG …)
Feature Engineering:
Feature engineering means Transforming raw data into better features. So that machine learning model can
learn [Link] Make data more useful.
Feature Engineering is important because Model performance depends more on features than algorithm.
Even simple model works very well, If features are good.
Techniques:
Handling Missing Values:
If some values are empty:
Example:
Label Encoding
Red = 0
Blue = 1
Green = 2
Feature Scaling:
Some models like KNN, K Means, and Logistic Regression need scaling. If one feature is
very large, It dominates others.
Example: Salary = 100000 and Age = 25 so, Salary is very big compared to age.
Min Max Scaling:
X − Xmin
Xnew =
Xmax − Xmin
Standardization:
X − Mean
Z =
Standard Deviation
Handling Outliers:
Outliers disturb model.
We can:
Remove them
Cap them
Transform them
Transformation of Features:
If data is skewed, We apply:
Log transformation
Square root transformation
Example:
Example:
From 24-02-2026
Extract month = 2
Day = 24
Examples:
Math Combination:
Weight
We can create: BMI = (Height)2 BMI is new feature.
Interaction Features:
Date Features:
Polynomial Features:
If we have: X
Text Features:
Feature Selection:
Feature selection means choosing only important features and Removing useless or noisy [Link] feature
selection We reduce number of columns.
Feature Selection is important because, too many features cause:
Overfitting
slow training
High computation
Curse of dimensionality
Types of Feature Selection:
Filter methods select features before training the model. They use statistical tests. They do
not depend on machine learning model.
Used for numerical data. It measures linear relationship between feature and target.
Cov(X, Y)
r =
σX × σY
Where:
Cov = covariance
σ = standard deviation
If:
Used for categorical features. It checks dependence between feature and target.
∑(𝐎𝐛𝐬𝐞𝐫𝐯𝐞𝐝−𝐄𝐱𝐩𝐞𝐜𝐭𝐞𝐝)2
Chi square =
𝐄𝐱𝐩𝐞𝐜𝐭𝐞𝐝
Used when Feature is numerical and Target is categorical. It compares mean values of
different groups.
If group means are very different → feature is useful
It calculates F score:
Advantages of filter is Fast, Simple, Works well for high dimensional data
Disadvantages is: Ignores interaction between features, May select redundant features
1. Training a model
2. Ranking features by importance
3. Removing the least important feature
4. Repeating the process
It continues until required number of features remain.
That is why name is:
Embedded Methods:
Feature selection happens during model training. Selection is built inside algorithm.
2
Loss = ∑(yi − ypredicted )
2
Loss = ∑(yi – ypredicted ) + λ ∑|β|
Decision tree selects features. Based on information gain or Gini index. Feature with
highest information gain is selected first.
Important features appear at top of tree. Unimportant features may not appear.
Advantages of Embedded: Faster than wrapper, more accurate than filter, Considers
feature interaction.
Disadvantages: Depends on specific model, May not generalize to other models.
CHAPTER 4 (D IMENSIONALITY R ED …)
Dimensionality Reduction:
Dimension means number of features. If dataset has: Age, Income, Height, Weight, Then dimension = 4
If dataset has 100 features, Dimension = 100
Dimensionality reduction means Reducing number of features. But keeping most important information.
We need it when features are too many so Model becomes slow; Overfitting increases, Storage increases,
Visualization becomes difficult, Curse of dimensionality happens, Especially in Text data, Image data,
Genomics
1 Feature Selection
2 Feature Extraction
Feature Extraction:
Feature extraction means Transforming original features into a new smaller set of features That still
contains most of the important information.
Old features are replaced and new features are combinations of old features. So dimension decreases.
In feature selection, we delete some features. But in feature extraction, We combine information of
many features Into fewer new features. So information loss is smaller.
This transforms original features into new features called principal components. These
components Are linear combinations of original features Are uncorrelated and Capture
maximum variance
The main Objective is to maximize variance and Used when unsupervised dimensionality
reduction is needed.
Maximize ratio of between class variance to within class variance. Used mainly for
classification problems.
Maximum components: Number of classes - 1.
Independent Component Analysis (ICA):
Example:
Autoencoders:
Structure:
Middle layer gives reduced features. Can handle nonlinear relationships. Used in deep
learning.
Factor Analysis:
Similar to PCA. Assumes data is influenced by hidden factors. Used in psychology and social
sciences.
Model: X = LF + error
Where:
L = loading matrix
F = hidden factors
Kernel PCA:
Extension of PCA.
Example:
Components:
Types:
Undirected Graph:
Example:
Facebook friendship.
Directed Graph:
In this graph, relationship has direction. If A follows B, It does not mean B follows A.
So we draw an arrow from A to B.
Example: Twitter.
Weighted Social Network Graph:
For example:
Number of messages
Number of likes
Frequency of interaction
Clustering of Graph:
Clustering of graph means Dividing the graph into groups of nodes Such that nodes inside the same group
are strongly connected, And nodes in different groups are weakly connected.
Example:
Example:
K way partitioning
Two approaches:
Nodes with many connections form a cluster. Sparse areas are separated.
4. Spectral Clustering:
Community:
A community is a group of nodes that have many connections inside the group and few connections
with nodes outside the group.
In social network:
1. Girvan–Newman Method:
This method removes edges step by step. It calculates edge betweenness. Edge betweenness
measures how many shortest paths pass through an edge. Edges that connect two
communities usually have high betweenness. Algorithm removes highest betweenness edges.
As we remove such edges, Graph splits into communities.
2. Louvain Method:
Modularity:
Modularity measures Difference between actual edges inside community And expected edges in
random graph
High modularity means Strong community structure. Value of modularity is between minus 1 and 1.
Higher value means better clustering.
Important:
It helps in:
Neighborhood in a Graph:
The neighborhood of a node means All nodes that are directly connected to that node by an edge.
If a node is a person in a social network, Its neighborhood is all the friends of that person.
Neighborhood Properties:
Neighborhood properties describe How a node is connected to its neighbors And how neighbors are
connected among themselves.
Example:
2. Clustering Coefficient:
Clustering coefficient measures how connected the neighbors of a node are among themselves.
Formula:
3. Ego Network:
Ego network of a node includes:
4. Neighborhood Overlap:
Formula:
Data Visualization:
Data visualization means Representing data in a visual form such as charts, graphs, or plots. It is used to
make data easier to understand, Identify patterns, trends, and relationships and Communicate insights
effectively
It is the process of showing data visually so that humans can easily understand it.
Important:
Data visualization is important because:
Types:
1. Charts and Graphs:
2. Plots:
3. Advanced Visualizations:
Clarity:
Simplicity:
Keep the design simple and easy to read.
Do not overload the chart with too much information. Highlight only the important points.
Accuracy:
The chart or graph must represent data correctly. Do not distort scales or use misleading visuals.
Example: A truncated Y-axis can exaggerate differences, which is not accurate.
Consistency:
Use consistent colors, scales, and labels across all visualizations.
This helps the audience compare different charts easily.
Proper Labeling:
Always include:
Axis labels
Titles
Legends (if necessary)
Units of measurement
Scales
Proportions
Colors
To convert raw data into a visual format such as charts, graphs, or maps.
To identify patterns, trends, and relationships that are not easy to see in tables.
To communicate insights effectively to decision-makers or audiences.
To make large datasets understandable and actionable.
Data visualization helps humans understand data quickly and make informed decisions.
Example:
A sales report with 1,000 rows of data is hard to read.
A line chart showing monthly sales trend makes it easy to understand.
1. Programming Tools:
Python Libraries:
o Matplotlib: Basic plotting library for line, bar, scatter, and pie charts.
o Seaborn: Built on Matplotlib; easier and visually better for statistical charts.
o Plotly: Interactive and dynamic charts for dashboards.
o Bokeh: Interactive web visualizations.
R Libraries:
o ggplot2: Powerful for creating statistical graphics.
o Lattice: Multivariate data visualization.
2. Software Tools:
Example:
Not using personal data without permission.
Avoiding biased models that discriminate against certain groups.
Important:
Ethics is important because:
Principles of Ethics:
1. Privacy:
2. Transparency:
Ensure models do not discriminate based on gender, race, religion, or other sensitive
attributes.
Check for biased data and correct it.
4. Accountability:
Data scientists should take responsibility for the results of their models and analysis.
Errors or harm caused by models must be addressed.
5. Security:
6. Integrity:
The main ethical issues are privacy, security, fairness, and responsibility.
Privacy:
Example:
A company tracking customer location without permission violates privacy.
Security:
Example:
A cloud database with unprotected user data can be stolen, leading to harm.
Ethics:
Ethics in data science ensures responsible, fair, and honest use of data.
Avoid biased models that discriminate based on race, gender, religion, or age.
Present findings accurately without manipulation.
Be accountable for decisions made using data.
Example:
A hiring algorithm rejecting qualified candidates due to biased training data is unethical.
Future data scientists should combine technical skills with ethical responsibility to ensure data benefits society
without causing harm.