0% found this document useful (0 votes)
6 views45 pages

Introduction To Data Science

The document serves as an introduction to data science, covering key concepts such as data types, databases, data warehouses, data mining, and the importance of data science in various industries. It outlines the skills required for data scientists, the significance of big data, and the ethical considerations in data handling. Additionally, it discusses the current landscape of data science, including machine learning, data visualization, and interdisciplinary collaboration.

Uploaded by

iu708000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views45 pages

Introduction To Data Science

The document serves as an introduction to data science, covering key concepts such as data types, databases, data warehouses, data mining, and the importance of data science in various industries. It outlines the skills required for data scientists, the significance of big data, and the ethical considerations in data handling. Additionally, it discusses the current landscape of data science, including machine learning, data visualization, and interdisciplinary collaboration.

Uploaded by

iu708000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction To Data Science

5th Semester-
University Of Science & Technology Bannu

Prepared By:
CodeZdaka Team (USTB)

These notes are created by the CodeZdaka-Team (USTB).


While we try our best, some errors might be present.
We appreciate your understanding and welcome any corrections or suggestions.
Feel free to reach us at codezdaka@[Link]

Compiled by…
Date: Nov-2025

©2025 CodeZdaka Team. All rights reserved.


CHAPTER 1
Data:
• Collection of raw facts and figures
• The word raw mean that the facts are not processed to get meaning

Types:
Can be divide into three category:

Category Types Very Short Meaning

Main Types Structured Table-form data

Unstructured No fixed shape (like images or text)

Semi-Structured Has some structure like tags or JSON

Types by Value Numerical (Discrete / Continuous) Number data — counts or measurements

Categorical (Nominal / Ordinal) Group names — unordered or ordered

Time-Series Data changing over time

Spatial Data with location or coordinates

Types by Representation Numeric Only numbers

Alphabetic Only letters

Alphanumeric Letters and numbers

Special Characters Symbols like @, #, $

Text Normal written sentences

Image Pictures or photos

Audio Sounds or voice

Video Moving pictures with sound

Binary Stored as 0s and 1s

Database:
A database is an organized collection of related data that can be easily stored, managed, and retrieved when needed.
OR
The collection of logically interrelated data that can be used or shared by many users according to their need.
A database is managed using software called a DBMS (Database Management System).
There are two main kinds of databases:
 Relational Database – stores data in connected tables (e.g., MySQL, PostgreSQL).
 Non-Relational (NoSQL) Database – stores data in documents or key-value form (e.g., MongoDB).

Data Warehouse:
A data warehouse is a large central system that stores Historical data from many different sources for analysis,
reporting, and decision-making.
Unlike a normal database (used for daily work), a data warehouse is used to study data, find patterns, and make
smart decisions.

Main features of a data warehouse:

 Stores historical data (many years old).


 Combines data from different sources.
 Used for reporting and analysis, not for daily updates.
 Usually read-only — we do not edit data here.
 Helps in business intelligence (BI) and data mining.

Difference between Database and Data Warehouse

Feature Database Data Warehouse

Purpose Used for daily operations and Used for data analysis, reporting, and decision-
transactions making

Data Type Stores current (real-time) data Stores historical (past) data

Data Source Usually one main source Combines data from many sources

Data Usage For adding, updating, and deleting data For analyzing, summarizing, and reporting data

Users Used by operational staff and developers Used by analysts and managers

Data Structure Highly normalized (many small linked Denormalized (combined tables for fast
tables) analysis)

Query Type Simple, frequent queries Complex, large analytical queries

Update Updated frequently in real-time Updated periodically (daily, weekly, etc.)


Frequency

Storage Volume Smaller amount of data Very large amount of data

Examples MySQL, Oracle, PostgreSQL Amazon Redshift, Google BigQuery, Snowflake


Data Mining:
Data mining is the process of finding hidden patterns, trends, and useful information from large sets of data.
Data mining is an important part of data science and business intelligence. It helps companies make better
decisions, predict trends, and understand customer behavior.

Some common techniques of data mining are:

 Classification – sorting data into groups (e.g., spam or not spam).


 Clustering – grouping similar data (e.g., grouping customers by habits).
 Association – finding relations (e.g., “people who buy bread also buy butter”).
 Regression – predicting future values (e.g., sales forecast).
 Anomaly Detection – finding unusual patterns (e.g., fraud detection).

Data science:
Data science is the interdisciplinary field that uses scientific methods, mathematics, and computer tools to collect,
clean, analyze, and understand data to make better decisions.
Interdisciplinary means data science uses ideas from many subjects together.

Main Subjects Used in Data Science:

1. Mathematics

 Calculus – study of change, helps in optimization.


 Linear Algebra – vectors and matrices, used in ML.
 Probability – deals with uncertainty.

2. Statistics

 Descriptive Statistics – summarizing data


(mean, median, mode, variance, charts).
 Inferential Statistics – making conclusions and predictions
(hypothesis testing, confidence intervals, correlation, regression).

3. Computer Science / Programming

 Python
 Algorithms
 Data structures
 Libraries (Numpy, pandas, sklearn, matplotlib, seaborn)

4. Domain Knowledge

 Understanding the specific field


(business, finance, health, marketing, sports, etc.)

Data science is all about turning raw data into useful knowledge.

A data scientist collects the data, cleans it (removes errors), analyzes it, and explains what it means in a way that
helps others make decisions.
Data Science Hype:
Data Science Hype means the period when data science became extremely popular, and people started thinking it
can solve almost every problem.
Companies, media, and universities promoted data science as a magical field, creating very high expectations.

Processes involved in Data Science:


• Data science involves a variety of processes, including data collection, data cleaning and preparation, data
analysis, and communication of results.
• Data is collected from various sources, such as databases, APIs, or sensors, and then cleaned and
transformed to ensure that it is suitable for analysis.
• Data analysis involves the use of statistical and machine learning techniques to extract insights from the
data, which are then communicated to stakeholders through various means, such as visualizations or reports.

Applications of Data Science and Skills required for Data Science:


• Data science has applications in a wide range of industries, including finance, healthcare, marketing, and
transportation.
• It is used for tasks such as fraud detection, predictive maintenance, customer segmentation, and
personalized recommendations.
• To be a successful data scientist, one must have a strong foundation in mathematics and statistics, as well as
expertise in programming languages such as Python or R.
• Strong communication skills are also important to effectively communicate insights to stakeholders.
Why Data Science?
1. Data has become very important for industries. It is like the new electricity. Companies need data to grow,
improve, and make better decisions.

2. Data Scientists work with this data to help companies understand it and use it to make proper decisions.

3. Companies use a data-driven approach, where large amounts of data are analyzed to find useful insights.
These insights help companies know how they are performing in the market and how to improve.

4. Data Science is not only used in business. Healthcare industries also use it to detect problems like tumors
or deformities at an early stage, which helps in faster diagnosis and better treatment.

Big Data:
Big Data means extremely large datasets that are too big, too fast, or too complex for traditional data processing
tools to handle. It is a core part of data science because modern systems generate massive amounts of data every
second.

Characteristics of Big Data (The 5 V’s):


• Volume: Data size is very large (in TB, PB, or more). Example: Facebook stores billions of images,
YouTube stores millions of videos daily.

• Velocity: Data is generated very fast and must be processed quickly. Example: Stock market updates every
millisecond, live sensor data from cars and IoT devices.

• Variety: Data comes in many forms – structured (tables, numbers), semi-structured (JSON, logs), and
unstructured (images, videos, text).

• Veracity: Data may contain errors or noise. Big Data systems must clean and filter it.

• Value: The goal is to extract useful insights. Example: Amazon recommends products, hospitals predict
disease risks.

Technologies and Techniques for Big Data:


Big Data needs special tools and techniques because normal methods cannot handle its size, speed, or complexity.

• Big data technologies, such as Hadoop and Spark, provide the tools and techniques needed to store, process,
and analyze large and complex data sets.
• Machine learning algorithms and other data science techniques are also used to extract insights from big
data.
• The insights obtained from big data can help businesses and organizations make informed decisions,
optimize their operations, and improve their products and services.

The tools and technologies can be:

 Storage & Processing Tools.


 Cloud Platforms
 Data Integration Tools
 Data Streaming Tools

DataFication:
• Datafication means turning different parts of the world—people, places, things, and activities—into digital
data.
• This process includes collecting, storing, and analyzing large amounts of data so it can be used to gain
insights, make decisions, and create value.
• The rise of digital technologies and more connectivity between people, devices, and systems has made
datafication grow rapidly.

Examples:

 IoT devices and sensors collect huge amounts of data about the physical world.
 Social media and e-commerce platforms collect data about consumer behavior and preferences.

Datafication is important because it provides the raw data that can be analyzed to make smarter decisions.

Current Landscape of Data Science:


This means the big picture of Data Science today—all the tools, trends, applications, and challenges.

Perspectives of Data Science:


These are the different angles or points of view from which we can look at Data Science, like business, ethical,
technological, societal, and career perspectives.

Current landscape of perspectives in Data Science means the overall modern view of Data Science, seen from
different important angles.

 Machine Learning and Artificial Intelligence:


Machine learning (ML) and artificial intelligence (AI) are very important perspectives in Data Science.
Researchers are constantly exploring new algorithms and techniques, like deep learning and reinforcement
learning, to make predictive models more accurate and efficient.
ML and AI are applied in many areas, including image recognition, speech recognition, natural language
processing (NLP), and predictive modeling. These technologies allow computers to learn from data and
make decisions without being explicitly programmed for every task.

 Ethical and Privacy Perspective:


Ethics and privacy are critical concerns in Data Science. As data becomes more common in our daily
lives, worries about how it is collected, stored, and used are increasing.
Data scientists must make sure that the data they use is collected ethically and analyzed responsibly. They
must also ensure that their models do not cause bias, discrimination, or unfair results, protecting people's
privacy and rights.

 Big Data and Data Engineering:


Big Data is another major perspective in Data Science. The volume, variety, and complexity of data are
growing fast due to businesses, organizations, and individuals generating huge amounts of information.
Data engineers develop new tools and technologies to store, process, and manage large datasets
efficiently. This ensures that data can be used effectively for analysis and decision-making.

 Data Visualization:
Data visualization is an essential perspective because it allows analysts and stakeholders to explore and
understand data easily. Tools like Tableau and Power BI help present data in a clear, visual format.
New techniques, such as augmented reality (AR) and virtual reality (VR), are being developed to make
data exploration more interactive and engaging, helping people see patterns and trends more clearly.

 Interdisciplinary Collaboration:
Data Science is not just about coding or statistics; it requires collaboration between people from different
backgrounds. This includes statisticians, computer scientists, domain experts, and business analysts.
Working together ensures that data is analyzed correctly, interpreted accurately, and the insights derived are
applied effectively to solve real problems.

 Data Science for Social Good:


Data scientists are increasingly using their skills to address social and environmental issues. Examples
include climate change, public health, and poverty reduction.
This perspective shows how Data Science can positively impact society, not just businesses, by helping
make informed decisions for the betterment of communities and the environment.

Skills Needed for Data Science:


 Programming:
Data scientists must be skilled in at least one programming language, such as Python, R, or Java.
They should also know how to work with databases, handle data structures, and use algorithms to solve
problems efficiently.
Programming allows them to process large datasets, automate tasks, and implement machine learning
models.

 Statistical Analysis:
A strong foundation in statistics is very important.
This includes knowledge of probability, regression analysis, hypothesis testing, and experimental
design.
Statistics helps data scientists understand patterns in data, make predictions, and validate their
findings.

 Machine Learning:
Machine learning is a critical skill because it enables data scientists to build predictive models and
intelligent algorithms.
They should know different types of machine learning algorithms, including supervised learning (where
data has labels) and unsupervised learning (where data has no labels).
Machine learning helps in solving real-world problems like predicting customer behavior, detecting
fraud, or recognizing images and speech.

 Data Visualization:
Visualizing data is essential for communicating insights clearly to stakeholders.
Data scientists should know how to use tools like Tableau, Power BI, or Matplotlib to create charts,
graphs, and dashboards.
Visualization helps make complex data easier to understand and supports data-driven decision making.

 Business Acumen (Understanding):


Understanding the business and industry is crucial.
Data scientists must be able to identify business problems, develop hypotheses, and translate data
insights into actionable recommendations.
Business knowledge ensures that data analysis aligns with real-world goals and strategies.

 Communication Skills:
Data scientists need to explain complex technical concepts in simple terms to non-technical stakeholders.
Excellent written and verbal communication is necessary to present insights clearly and effectively.
Communication helps ensure that data insights are understood and used in decision-making.

 Creativity:
Creativity is important because data scientists often need to find innovative solutions to complex
problems.
They should be able to look at problems from different angles and think outside the box.
Creative thinking helps in discovering hidden patterns in data and developing new approaches for
analysis.

CHAPTER 2 (EDA)

Types of Data In Data Science:


Data in statistics or data science can be of different types and measured on different scales. These help us
know how to handle, analyze, and visualize the data.
We can measure it directly or collect it through experiments, surveys, or observations
Types of Data:
1) Quantitative (Numerical) Data:
Quantitative data is numerical information that we can measure or count. It helps us perform calculations
and statistical analysis. It show the quality of something.

There are two main types:

 Discrete Data:
Think of discrete data as things we can count one by one. They are separate numbers, usually whole
numbers, and we cannot split them into smaller [Link] will always in integer form.

o Examples:
 Number of students in a class, we can’t say 10.5 student.
 Number of cars in a parking lot
 Goals scored in a football match
o How we show it: Bar charts, histograms, tables with counts
 Continuous Data:
Continuous data is like measuring something, not counting it. It can take any value, even fractions or
decimals, and can be very precise.

o Examples:
 Temperature (36.6°C)
 Weight (70.5 kg)
 Time to finish a race
o How we show it: Line plots, scatter plots, box plots

Continuous data is further divided into:

 Interval Data
o Numbers have equal steps between them, but zero does not mean nothing.
o Example: Temperature in Celsius, calendar dates, time of day
o We can add or subtract, but cannot multiply or divide meaningfully
 Ratio Data
o Numbers have a true zero, which means nothing exists at zero.
o Example: Weight, height, distance, time
o We can add, subtract, multiply, divide

1. Qualitative (Categorical) Data:


The data that describes qualities or categories, not numbers. We cannot do math on it, but we can count or
group it.

 Nominal
o These are categories without any order.
o We cannot say one category is higher or lower than another.
o Examples: Colors (red, blue, green), Gender (male, female), Blood type (A, B, O, AB).
o Use for counting how many items are in each category or for pie charts and bar graphs.
 Ordinal
o These are categories with a meaningful order, but the difference between them is not exact.
o Example: Customer satisfaction (poor, average, good, excellent), Education level (high school,
bachelor, master, PhD).
o we know the order matters, but you cannot measure how much bigger or better one category is
compared to another.

Types of
Data

Quantitative Qualitative
Data Data

Discrete Continuous Non


Data Data numerical

Interval
Scale Ratio Scale Nominal Ordinal
Exploratory Data Analysis (EDA):
Exploratory Data Analysis (EDA) is the process where we study and explore data to understand patterns,
mistakes, missing values, and relationships before doing any model or deep analysis.

 We do EDA to make sure the data is clean, correct, and ready for machine learning.
 We don’t start with complex math in EDA.
 We just look, explore, and understand that “what’s going on inside”.

Basic tools and techniques:

Basic tools and techniques in EDA are the simple methods we use to look at data, clean data, and
understand data before doing advanced analysis. Some best tool and techniques are following:

Descriptive statistics.

• This gives us a quick summary of a column.


• We see the mean, median, mode, minimum, maximum, range, and percentiles.
• These numbers help us understand how the data is spread and where most values lie.

Data visualization.

• Pictures help us understand the data much faster.


• We use graphs to see patterns, trends, and outliers.
• Common graphs are histograms, bar charts, scatter plots, box plots, and heatmaps.

Histogram:
is a graph that shows how a numeric column is spread.

• The values are divided into small groups called bins.


• Then we see how many values fall in each bin.
• This helps us see the shape of the data.

Box plot:

• shows the middle value, the quartiles, and the outliers.


• It is great for checking if some values look too big or too small.
• It also shows how wide or narrow the data is.

Scatter plot:

• shows the relationship between two numeric columns.


• Each point shows a pair of values.
• We can see if the two variables move together or not.
• We can see patterns, clusters, or trends.

Bar chart:
is used for category data.

• Each category has a bar.


• The height of the bar tells us how many times that category appears.
• It is a simple way to compare groups.

Heatmap:
shows values using colors.

• It is often used for correlation.


• Dark or light colors show strong or weak relationships.
• It helps a lot when the dataset is large.

Cross-tabulation:
is used for two category columns.

• We make a table that shows how often each pair of categories appears.
• This helps us see the relationship between two categorical variables.

Outlier detection:

• It is process of finding values that are far away from the rest.
• These values can affect our analysis.
• We can find outliers using z-score, Tukey’s method, or simply by looking at a box plot.

Data transformation.

• Sometimes data is not in a good shape.


• Maybe it is too skewed or not normal.
• We can fix this using log transform, square root transform, or scaling.
• This makes the data easier to work with.

Philosophy of EDA:
Philosophy of EDA means the basic way of thinking while exploring data.
It says that we first understand the data through simple checks and visuals before doing any big model.
 When we do EDA, we try to “see” the data clearly.
 We use simple plots.
 We stay curious.
 We move step by step.
 We let the data guide us.
The core principles of EDA:

Uncovering insights with visuals


• We use pictures to understand the data.
• Things like histograms, scatter plots, box plots help us see patterns.
• We can see outliers, trends, relationships.
• Raw numbers cannot show these things easily, but visuals can.
Explore before modelling:
• We do not start with machine learning or heavy math.
• We first explore.
• We learn the shape of the data.
• We learn how values are spread.
• Then we decide cleaning steps, feature steps, and model steps.
This makes our work better.

Open mind and curiosity:

• We do not assume anything.


• We stay open.
• We ask small questions.
• We check if the data is saying something new.
• Sometimes we find surprises that we never expected.

Iterative process:
• EDA is not done only once.
• We explore, then we think, then we explore more.
• New questions come.
• We make new plots.
• Slowly, we understand the data deeper.

Holistic understanding:

• We also think about the real-world meaning of the data.


• We think about how the data was collected.
• We think about limits or mistakes in the data.
• We try to see the whole picture, not just numbers.

Communication and storytelling:


• EDA is also about talking clearly.
• We use simple charts and simple words to explain insights.
• We tell a small story: what we found and why it matters.

Data quality and integrity:

• We check if the data is clean.


• We search for errors.
• We search for missing values.
• We search for strange points.
• Good data gives good results.
• Bad data gives bad results.
CHAPTER 3 (M ACHINE L EARNING )
What is learning?
Learning is the process of acquiring new understanding, knowledge, skills, behaviors, values, attitudes, or
preferences through study, experience, or teaching.

Learning is how humans, animals, and some machines gain knowledge or skills. It happens when we study,
practice, observe, or are taught.

 It changes the way we think, behave, or perform tasks.


 It involves acquiring and processing information.
 Through learning, we develop abilities, attitudes, and understanding.

What is Machine learning (ML)?


Machine Learning (ML) is a subfield of artificial intelligence that enables machines to learn from data, improve
their performance on tasks, and adapt to new situations without being explicitly programmed.

Machine learning means teaching computers to learn from experience.

 According to Herbert Simon, learning is any process by which a system improves performance from
experience.
 According to Tom Mitchell (1998), ML studies algorithms that improve performance (P) at a task (T)
with experience (E). A learning task can be defined as <P, T, E>.
 Instead of fixed rules, ML algorithms analyze data, find patterns, and adapt to changes.
 ML is used in many areas like recommendation systems, self-driving cars, voice assistants, and spam
detection.

ML is making machines smarter by learning from data and experience, not just following instructions.

ML Uses when:
 Human expertise does not exist (e.g., navigation on Mars)
 Humans cannot explain their expertise (e.g., speech recognition)
 Models must be customized for specific tasks
 Models are based on huge amounts of data

We use ML when problems are too complex, data is large, or human knowledge is limited.

Traditional Programming Vs Machine Learning:


Traditional programming uses rules and logic written by humans to produce output from input, while Machine
Learning uses data and experience to automatically learn patterns and generate output.

The main difference is how the output is produced:


 Traditional Programming:
o We provide input + rules → output.
o Humans write the logic and rules.
o Works well for well-defined problems.
o Example: A calculator adds numbers because we wrote the rules.

 Machine Learning:
o We provide data + output → algorithm learns rules → predicts new output.
o Machine finds patterns and improves from experience.
o Works well for complex problems where rules are hard to write.
o Example: Email spam filter learns from examples of spam emails.

Commercial Motivation for Machine Learning


 Machine learning also helps to save cost.
 We see machine learning used a lot in business. We can automate work.
The main reason is money and growth. This means computers do tasks instead of
Companies want to earn more and spend less. humans.
Machine learning helps in this. For example, chatbots reply to customers.
Fraud systems stop fake transactions.
 We can use machine learning to understand customers.
We see what people like.  We also use machine learning to make better
We see what they buy. decisions.
We see when they buy. We can predict future sales.
We can predict demand.
Using this, companies can: We can manage stock better.

 show ads to the right people So, commercial motivation means using machine
 suggest products people may like learning to:
 keep customers for a long time
 increase profit
 reduce cost
 improve business decisions

Scientific Motivation for Machine Learning:


 In medicine, we study diseases.
 Now we talk about science. We analyze patient data.
Here the goal is not money. We try to find early signs of illness.
The goal is knowledge and understanding.  In biology, we study genes and cells.
 Scientists want to understand complex data. Machine learning helps to find relations
Some data is too big and too hard for humans. between them.
Machine learning helps here.  In physics and astronomy, we study space data.
There is a huge amount of data.
We use machine learning to: Machine learning helps to analyze it fast.
 find patterns in data So, scientific motivation means using machine learning
 discover new facts to:
 test scientific ideas
 learn new things
 understand nature
 solve hard scientific problems
Types of Machine Learning

 Supervised learning (inductive learning):

This type uses training data.


Training data includes desired outputs.
Desired output means the correct answer is already given.

 Unsupervised learning:

This type also uses training data.


Training data does not include desired outputs.
There is no correct answer given.

 Reinforcement learning:

This type learns from actions.


Actions happen in a sequence.
The system gets rewards from these actions.

Supervised & Unsupervised Learning:

Supervised Learning Unsupervised Learning

 Uses a labeled dataset.  Uses an unlabeled dataset.


 Finds relationship between input and output.  Finds structure in the data.
 Can generate output for new data points.  Output attributes are not defined.
 Models are reliable but expensive and
limited. Examples:

Examples:  Clustering: K-means, DBScan,


Hierarchical algorithms, Self-Organizing
 Classification: Associative classifiers, Maps, etc.
Decision Trees, Instance Learning, Bayesian  Associations: Apriori, FP-Growth, …
Learning, Kernel Machines, Neural
Networks, Genetic Algorithms, etc.
 Regression: Linear Regression, …

Reinforcement Learning

 Focuses on maximizing rewards from results.


 Also called credit assignment learning.
 Involves additional decisions about rewards.
 Explores tradeoff between exploring new actions and exploiting known data.
Supervised Learning:
Regression:
Regression is used when the output is a number. This number can change smoothly. We call it a
continuous value.
Regression uses other variables to predict one value.

Regression models can be:

 linear, means a straight line


 nonlinear, means a curve

Regression is very common. It is used in statistics. It is also used in neural networks.

Examples:

• Predict sales of a new product using advertising data.


• Predict wind velocity using temperature, humidity, and air pressure.
• Predict stock market trends over time. This type of data is called time series.

Classification:
Classification is used when the output is a class. A class is a group or label. Each record belongs to
one class only.
Classification helps us group data. It helps us find behavior patterns.

Example:

We look at:

 total money spent


 total items purchased

If a customer spends more than 800 dollars in one purchase,


we classify the customer as a good customer.
Otherwise, the customer goes to another class.

How Classification Works:

First, we collect data. This data is called the dataset.


The dataset is divided into two parts:

 training set
 test set
Training set:

The training set is used to build the model. Each record has attributes. One attribute is the
class label.
The model learns how attributes relate to the class.

Test set:

The test set is used to check the model. It checks accuracy.


Accuracy means how many predictions are correct.

Unsupervised Learning:
Clustering:

Clustering means making groups. We give the model many data points. Each point has some
features.

The model checks similarity. Similar points go into the same group. Different points go into
different groups.

Inside one group, points are close. Between groups, points are far. That is the main idea.
Simple and useful.

To measure closeness, we often use distance. For numeric data, we use Euclidean distance.
That means normal distance.

Algorithms used in clustering:

We commonly use:

 K-Means
 Hierarchical Clustering
 DBSCAN

These algorithms do the grouping work.

Where clustering is used:

A very common use is documents. We give many documents to the model. The model looks
at important words. Documents with similar words go together.

Search systems use this. When we search, similar documents appear. Google Scholar is one
example.
Association:
This also comes under unsupervised learning. Here we do not make groups. Instead, we find
relationships.

We check which things appear together. Mostly used in shopping data.

For example, customers buy items. Some items are often bought together. This is very useful for
stores.

A famous rule is:


If diaper and milk are bought,
then eggs are also likely to be bought.

Stores place these items close. Sales increase. Smart use of data.

Algorithms used in association rules:

We commonly use:

 Apriori
 FP-Growth

These algorithms find item relationships.

Anomaly Detection:

Anomaly detection is about finding something unusual. The data has no labels.
So we do not know what is normal or abnormal at first.

The model looks at the data. It learns what is normal behavior. Anything very different is marked
as an anomaly. An anomaly is also called an outlier.

This is useful when rare events matter a lot.

What is an anomaly:

An anomaly is a data point that:

 looks very different from most data


 does not follow the common pattern

It happens rarely.
But it is very important.

How anomaly detection works:

We give the model normal data. The model learns the normal pattern. Then new data comes.
If it fits the pattern, it is normal. If it is far from the pattern, it is an anomaly.
Distance, density, or pattern difference is used. No class labels are needed.
Algorithms used for anomaly detection:

Common algorithms are:

 Isolation Forest
 One-Class SVM
 Local Outlier Factor (LOF)
 Autoencoders

These algorithms find rare or strange patterns.

Reinforcement Learning:
Reinforcement learning is different. Here an agent learns by trying. No teacher gives correct answers.

The agent takes an action. Then it gets a reward or punishment. The reward may come later.
So learning takes time.

The agent slowly learns:

which action is good,


which action is bad.

This is like training a child or a pet. Try, fail, learn, repeat.

there are two main ways to learn:

 model-free
 model-based

Algorithms used in reinforcement learning:

Common ones are:

 Q-Learning
 SARSA
 Deep Q-Network (DQN)

These help the agent learn from rewards.


CHAPTER 4 (ML A LGO …)
Linear Regression:
Linear Regression is a statistical method used to find the relationship between variables using a straight line.
It shows how one variable changes when another variable changes.

Types:

1. Simple Linear Regression:

One independent variable and one dependent variable.

Model:

Y = a + bX

Where:

Y = dependent variable
X = independent variable
a = intercept
b = slope

Slope:

∑(𝐱 − 𝐱ˉ)(𝐲 − 𝐲ˉ)


𝐛=
∑(𝐱 − 𝐱ˉ)𝟐

𝐧∑(𝐱𝐲) − (∑𝐱∑𝐲)
𝐛=
𝒏∑(𝐱)𝟐 − (∑𝐱)𝟐

Intercept:

𝑎 = yˉ − bxˉ

∑y − b∑x
𝑎=
n

2. Multiple Linear Regression:

More than one independent variable.

𝑌 = a + b1 X1 + b2 X2 . . . . bn Xn
K Nearest Neighbor:
K Nearest Neighbor is a supervised machine-learning algorithm.

It is used for:

 Classification problems
 Regression problems

It works by checking the nearest data points.

KNN Steps:

1. Choose a number K
2. Find the K nearest points to the new point
3. Check their classes
4. The majority class becomes the answer

What is K?

K is the number of nearest neighbors we check. Usually K is an odd number to avoid tie.

Example:

 If K = 3, we check 3 nearest points


 If K = 5, we check 5 nearest points

How do we measure "nearest"?

We use distance formula. Most common distance is Euclidean Distance.

Formula:

If we have two points:

Point 1 = (x1, y1)


Point 2 = (x2, y2)

Distance = √(𝑥2 – 𝑥1 )2 + (𝑦2 – 𝑦1 )2

For N-dimensional:

Distance = √∑𝑛𝑖=1(𝑞𝑖 − 𝑝𝑖 )2

Where “p” are features of one point and “q’ are features of 2nd point.

Important Note:
1. For Classification, we look at Majority voting, new point goes to class of majority voting.
2. For Regression, Instead of majority voting, we take average.
Naive Bayes:
Naive Bayes is a supervised machine-learning algorithm.

It is mostly used for:

 Text classification
 Spam detection
 Sentiment analysis

It is based on Bayes Theorem.

P(B|A). P(A)
P(A|B) =
P(B)

Where:

P(A | B) → Probability of A given B


P(B | A) → Probability of B given A
P(A) → Prior probability of A
P(B) → Probability of B

Posterior = Likelihood × Prior / Evidence

We usually do not calculate P (B) because it is same for both classes.


Final Formula Used in Naive Bayes For many features:

P(Class|X) ∝ P(Class) × P(𝑥1 |Class) × P(𝑥2 |Class) × … × P(𝑥𝑛 | Class)

Where:

X is all feature [𝑥1 , 𝑥2 , 𝑥3 …𝑥𝑛 ]


Class: each class in output like [yes | No, Span | Not-Spam]

Why name is Naive?

It assumes that all features are independent. That means Feature 1 does not affect Feature 2. This
assumption is not always true in real life. That is why it is called naive.

Types of Naive Bayes:

1. Gaussian Naive Bayes


Used when features are continuous
2. Multinomial Naive Bayes
Used for text data
3. Bernoulli Naive Bayes
Used for binary features
K-Means:
K = number of clusters
Means = average of points in a cluster

o K Means is an unsupervised machine-learning algorithm. It is used for Clustering. Clustering means


grouping similar data points together. There are no labels in K Means.
o K Means tries to Divide data into K groups. Such that points inside one group are similar
o Each cluster has a center. That center is called centroid. Centroid is just the mean of all points in that
cluster.

Step-by-Step:

Suppose we choose K = 2.

1. Step 1: select 2 random points as centroids.

2. Step 2: Calculate distance of each data point from both centroids. Assign each point to nearest
centroid.

3. Step 3: Now calculate new centroid by taking mean of all points in that cluster.

Centroid formula:
For cluster with n points:

sum of all points


Centroid =
n
For two features:

x1 + x2 + … + xn
Centroid x =
n
y1 + y2 + … + yn
Centroid y =
n

4. Step 4: Repeat step 2 and 3 until centroids stop changing.

Advantages:
 Simple
 Fast
 Easy to implement

Disadvantages:
 Need to choose K
 Sensitive to outliers
 Works best with round shaped clusters
CHAPTER 4 (F EATURE E NG …)
Feature Engineering:
Feature engineering means Transforming raw data into better features. So that machine learning model can
learn [Link] Make data more useful.

Feature Engineering is important because Model performance depends more on features than algorithm.
Even simple model works very well, If features are good.

Techniques:
Handling Missing Values:
If some values are empty:

 Replace with mean


 Replace with median
 Replace with mode
 Or use constant value

Example:

Age column has missing values


Fill with average age.

Encoding Categorical Data:


Machine learning works with numbers. If we have Color = Red, Blue, Green, We convert
into numbers.

Two common methods:

Label Encoding
Red = 0
Blue = 1
Green = 2

One Hot Encoding


Red → 1 0 0
Blue → 0 1 0
Green → 0 0 1

Feature Scaling:

Some models like KNN, K Means, and Logistic Regression need scaling. If one feature is
very large, It dominates others.

Example: Salary = 100000 and Age = 25 so, Salary is very big compared to age.
Min Max Scaling:

X − Xmin
Xnew =
Xmax − Xmin

Value becomes between 0 and 1.

Standardization:

X − Mean
Z =
Standard Deviation

It makes data mean = 0


Standard deviation = 1

Handling Outliers:
Outliers disturb model.

We can:

 Remove them
 Cap them
 Transform them

Transformation of Features:
If data is skewed, We apply:

Log transformation
Square root transformation

Example:

Income data is usually right skewed


Apply log to make it normal.

Date and Time Features:


From date column we can extract: Year, Month, Day, Weekday

Example:

From 24-02-2026
Extract month = 2
Day = 24

This helps model understand pattern.


Feature Generation:
Feature generation means creating new features from existing data To give more useful information to
model. We create new columns.

It is important because raw data is not always powerful.


Better features = better model performance.

Examples:
Math Combination:

If we have Height and Weight

Weight
We can create: BMI = (Height)2 BMI is new feature.

Interaction Features:

If we have: Age and Income

We can create Age × Income. Sometimes interaction helps model.

Date Features:

From date: 24-02-2026

We generate: Year, Month, Day, Is weekend. These are new features.

Polynomial Features:

If we have: X

We generate:X², X³. Used in linear regression to capture non-linear relation.

Text Features:

From text, Convert into: Word count, TF-IDF, Bag of words


These are generated features.

Feature Selection:
Feature selection means choosing only important features and Removing useless or noisy [Link] feature
selection We reduce number of columns.
Feature Selection is important because, too many features cause:

 Overfitting
 slow training
 High computation
 Curse of dimensionality
Types of Feature Selection:

Filter Methods (Statistical Based):

Filter methods select features before training the model. They use statistical tests. They do
not depend on machine learning model.

They check relation between Feature and Target

If relation is strong → keep


If relation is weak → remove

(A) Correlation Method:

Used for numerical data. It measures linear relationship between feature and target.

Cov(X, Y)
r =
σX × σY

Where:

Cov = covariance
σ = standard deviation

Value of r is between -1 and 1

If:

r close to 1 → strong positive relation


r close to -1 → strong negative relation
r close to 0 → weak relation

If correlation is very small → remove feature.

(B) Chi Square Test:

Used for categorical features. It checks dependence between feature and target.

∑(𝐎𝐛𝐬𝐞𝐫𝐯𝐞𝐝−𝐄𝐱𝐩𝐞𝐜𝐭𝐞𝐝)2
Chi square =
𝐄𝐱𝐩𝐞𝐜𝐭𝐞𝐝

(Row Total×Column Total)


Expected =
𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙

If Chi square value is high → feature is important


If low → feature is not important

Used mostly in classification.


(C) ANOVA (Analysis of Variance)

Used when Feature is numerical and Target is categorical. It compares mean values of
different groups.
If group means are very different → feature is useful

It calculates F score:

Between group variance


𝐹=
Within group variance

Higher F score → more important feature

 Advantages of filter is Fast, Simple, Works well for high dimensional data
 Disadvantages is: Ignores interaction between features, May select redundant features

Wrapper Methods (Model Based Search):


Wrapper methods use a machine-learning model To test different feature combinations. They
select features based on model performance.

(A) Forward Selection:

Step 1: Start with zero features

Step 2: Add one feature, Train model, Check performance

Step 3: Add another feature that improves performance most

Repeat until no improvement.

(B) Backward Elimination:

Step 1: Start with all features

Step 2: Remove one feature, Train model, Check performance

Step 3: Remove worst feature

Repeat until performance decreases.

(C) Recursive Feature Elimination (RFE):

1. Training a model
2. Ranking features by importance
3. Removing the least important feature
4. Repeating the process
It continues until required number of features remain.
That is why name is:

Recursive → repeating again and again


Elimination → removing features

 Advantages: More accurate, Considers feature interaction


 Disadvantages: Very slow, high computation cost, not good for very large dataset

Embedded Methods:
Feature selection happens during model training. Selection is built inside algorithm.

(A) Lasso Regression:

Lasso adds penalty to loss function. Normal Linear Regression Loss:

2
Loss = ∑(yi − ypredicted )

Lasso adds L1 penalty:

2
Loss = ∑(yi – ypredicted ) + λ ∑|β|

If coefficient becomes zero → feature removed. Lasso automatically makes some


weights zero.

(B) Decision Tree:

Decision tree selects features. Based on information gain or Gini index. Feature with
highest information gain is selected first.
Important features appear at top of tree. Unimportant features may not appear.

Advantages of Embedded: Faster than wrapper, more accurate than filter, Considers
feature interaction.
Disadvantages: Depends on specific model, May not generalize to other models.
CHAPTER 4 (D IMENSIONALITY R ED …)
Dimensionality Reduction:
 Dimension means number of features. If dataset has: Age, Income, Height, Weight, Then dimension = 4
If dataset has 100 features, Dimension = 100
 Dimensionality reduction means Reducing number of features. But keeping most important information.
 We need it when features are too many so Model becomes slow; Overfitting increases, Storage increases,
Visualization becomes difficult, Curse of dimensionality happens, Especially in Text data, Image data,
Genomics

Two Main Types:

1 Feature Selection
2 Feature Extraction

We already learned feature selection above.

Feature Extraction:

Feature extraction means Transforming original features into a new smaller set of features That still
contains most of the important information.

Old features are replaced and new features are combinations of old features. So dimension decreases.

In feature selection, we delete some features. But in feature extraction, We combine information of
many features Into fewer new features. So information loss is smaller.

Methods of Feature Extraction:

Principal Component Analysis (PCA):

Most common feature extraction method. Used for numerical data.

This transforms original features into new features called principal components. These
components Are linear combinations of original features Are uncorrelated and Capture
maximum variance

The main Objective is to maximize variance and Used when unsupervised dimensionality
reduction is needed.

Linear Discriminant Analysis (LDA):

Supervised feature extraction method. Uses class labels.

Maximize separation between classes and Minimize variation inside class

Maximize ratio of between class variance to within class variance. Used mainly for
classification problems.
Maximum components: Number of classes - 1.
Independent Component Analysis (ICA):

Used to separate mixed signals.

Find components that are statistically independent.

Example:

Separating different audio signals from mixed recording.

Used in signal processing and biomedical data.

T-SNE (t-Distributed Stochastic Neighbor Embedding):

Nonlinear feature extraction method. Used mainly for visualization.

Converts high dimensional data into 2D or 3D.

Preserves local structure. Not good for prediction models.

Autoencoders:

Neural network based method.

Structure:

Encoder → compress data


Decoder → reconstruct data

Middle layer gives reduced features. Can handle nonlinear relationships. Used in deep
learning.

Factor Analysis:

Similar to PCA. Assumes data is influenced by hidden factors. Used in psychology and social
sciences.
Model: X = LF + error
Where:

L = loading matrix
F = hidden factors

Kernel PCA:

Extension of PCA.

Used when data is non linear.

Uses kernel trick


To project data into higher space
Then applies PCA.
CHAPTER 5 (M INING S OCIAL -N ETWORK G RAPH )
Social Network:
A social network is a structure made of individuals, groups, or organizations. That are connected with each
other through relationships.
These relationships can be Friendship, Communication, Business connection, Following
A social network is a group of connected entities.

Example:

In Facebook, users are connected as friends.


In Twitter, users are connected by following.

Social Network as a Graph:


In data mining, a social network is represented as a graph.
A graph is a mathematical structure That consists of nodes and edges.

 Nodes represent People, Users, Organizations


 Edges represent Relationships, Connections

If two people are friends, we draw an edge between their nodes.

Components:

Node: Represents an entity in the network.

Edge: Represents a relationship between two nodes.

Graph: Collection of nodes and edges.

Types:

Undirected Graph:

In this graph, relationship is mutual. If A is connected to B, Then B is also connected to A.

Example:

Facebook friendship.

Directed Graph:

In this graph, relationship has direction. If A follows B, It does not mean B follows A.
So we draw an arrow from A to B.

Example: Twitter.
Weighted Social Network Graph:

Sometimes edges have weight. Weight represents strength of relationship.

For example:

Number of messages
Number of likes
Frequency of interaction

Higher weight means stronger connection.

Why Represent Social Network as Graph?

Graph representation helps in:

Finding important users


Detecting communities
Studying information spread
Analyzing network structure

Clustering of Graph:
Clustering of graph means Dividing the graph into groups of nodes Such that nodes inside the same group
are strongly connected, And nodes in different groups are weakly connected.

 These groups are called Communities or Clusters


 Nodes in same cluster have more connections with each other
than with nodes outside the cluster.
 Clustering helps in Finding communities, Detecting friend groups, Identifying similar users,
Understanding network structure

Example:

In Facebook, students of same university may form one cluster.


In Twitter, users interested in same topic may form one cluster.

Types of Graph Clustering Methods

1. Partition Based Clustering:

Graph is divided into fixed number of clusters.

Example:

K way partitioning

Nodes are divided into K groups


To minimize connections between groups.
2. Hierarchical Clustering:

Clusters are formed step by step.

Two approaches:

 Agglomerative method: Start with each node as separate cluster. Merge


similar clusters step by step.
 Divisive method: Start with whole graph. Split into smaller clusters.

3. Density Based Clustering:

Clusters are formed based on density of connections.

Nodes with many connections form a cluster. Sparse areas are separated.

4. Spectral Clustering:

Uses eigenvalues and eigenvectors of graph matrix.

Graph is converted into matrix form.


Then mathematical techniques are used
to divide graph into clusters.

Used when graph structure is complex.

Important Concept: Modularity

Modularity measures quality of clustering. High modularity means:

Many edges inside clusters


few edges between clusters

Higher modularity means better clustering.

Example in Social Network:

Suppose we have 100 users.

Group of 30 users mostly talk to each other.


Another group of 40 users mostly interacts inside their group.

Clustering algorithm will detect these as two communities.


Direct Discovery of Communities in Graph:
Direct discovery of communities means Finding groups of nodes in a graph, where nodes inside the group
are highly connected without converting the graph into another form like feature vectors.

 It works directly on the graph structure.


 We detect communities by analyzing connections between nodes.

Community:

A community is a group of nodes that have many connections inside the group and few connections
with nodes outside the group.

In social network:

A friend circle, Students of same university, People with same interest


These form communities.

Main Idea of Direct Community Detection

Good community has:

 Strong internal connections


 Weak external connections

So algorithm tries to:

 Maximize internal edges


 Minimize external edges

Important Methods for Direct Community Discovery:

1. Girvan–Newman Method:

This method removes edges step by step. It calculates edge betweenness. Edge betweenness
measures how many shortest paths pass through an edge. Edges that connect two
communities usually have high betweenness. Algorithm removes highest betweenness edges.
As we remove such edges, Graph splits into communities.

2. Louvain Method:

This method is based on modularity optimization.

Modularity measures Quality of division of graph. Higher modularity means better


communities.
Louvain method works in two steps:

First, each node is its own community.


Then nodes are moved to neighbor communities
If modularity increases. This method is fast and works for large networks.
3. Clique Based Method:

A clique is a complete subgraph.


In clique, Every node is connected to every other node.

Communities can be formed by combining overlapping cliques.


This method finds very strongly connected groups.

Modularity:

Modularity measures Difference between actual edges inside community And expected edges in
random graph
High modularity means Strong community structure. Value of modularity is between minus 1 and 1.
Higher value means better clustering.

Important:

It helps in:

 Detecting real world communities


 Finding hidden groups
 Understanding structure of social network
 Studying spread of information

Neighborhood in a Graph:
The neighborhood of a node means All nodes that are directly connected to that node by an edge.
If a node is a person in a social network, Its neighborhood is all the friends of that person.

Neighborhood Properties:
Neighborhood properties describe How a node is connected to its neighbors And how neighbors are
connected among themselves.

These properties help in:

 Analyzing influence of a node


 Finding communities
 Understanding local network structure
Properties:
1. Degree of a Node:

Degree is the number of neighbors a node has.

 In undirected graph: Degree = total edges connected to the node.


 In directed graph:
o In-degree = number of incoming edges
o Out-degree = number of outgoing edges

Example:

If a user has 10 friends, degree = 10.

2. Clustering Coefficient:
Clustering coefficient measures how connected the neighbors of a node are among themselves.

Formula:

Number of edges between neighbors


Clustering coefficient =
Maximum possible edges between neighbors

Value ranges from 0 to 1.

 1 means all neighbors are fully connected


 0 means neighbors are not connected

3. Ego Network:
Ego network of a node includes:

 The node itself (called ego)


 All its neighbors
 All edges among these neighbors

It shows local structure of a node’s connections.

4. Neighborhood Overlap:

Neighborhood overlap measures how many neighbors two nodes share.

Formula:

Number of common neighbors


Overlap =
Number of total unique neighbors

It helps in detecting communities and strong ties.


CHAPTER 6 (D ATA V ISUALIZATION )

Data Visualization:
Data visualization means Representing data in a visual form such as charts, graphs, or plots. It is used to
make data easier to understand, Identify patterns, trends, and relationships and Communicate insights
effectively

It is the process of showing data visually so that humans can easily understand it.

Important:
Data visualization is important because:

 Humans understand visuals faster than tables of numbers


 Helps identify trends and patterns in large datasets
 Detects anomalies or outliers
 Supports decision making
 Communicates insights to non-technical people

Types:
1. Charts and Graphs:

 Line Chart: Shows trend over time


 Bar Chart: Compares categories
 Pie Chart: Shows proportion of parts in a whole

2. Plots:

 Scatter Plot: Shows relationship between two variables


 Histogram: Shows distribution of a single variable
 Box Plot: Shows spread and outliers

3. Advanced Visualizations:

 Heatmap: Shows intensity of data values using colors


 Network Graph: Shows connections between entities
 Geographical Map: Shows data across regions

Steps in Data Visualization:


1. Collect and clean data
2. Choose the right type of visualization
3. Represent data using charts or plots
4. Add labels, titles, and legends
5. Interpret the results
Principles of Data Visualization:
Data visualization should be designed in a way that makes data easy to understand, interpret, and
communicate.
There are some fundamental principles to follow for effective visualization.

Clarity:

The visualization should clearly show the information without confusion.


Avoid unnecessary colors, shapes, or effects that distract the reader.
Example: A simple bar chart is better than a 3D chart if it shows the same information.

Simplicity:
Keep the design simple and easy to read.
Do not overload the chart with too much information. Highlight only the important points.

Accuracy:

The chart or graph must represent data correctly. Do not distort scales or use misleading visuals.
Example: A truncated Y-axis can exaggerate differences, which is not accurate.

Consistency:
Use consistent colors, scales, and labels across all visualizations.
This helps the audience compare different charts easily.

Proper Labeling:
Always include:

 Axis labels
 Titles
 Legends (if necessary)
 Units of measurement

Labels help the audience understand what is being shown.

Focus on Key Information:

The visualization should highlight important trends, patterns, or comparisons.


Use color, size, or annotations to guide attention to key points.

Appropriate Chart Type:

Choose the right chart or graph for the data type:

 Use bar charts for comparing categories


 Line charts for trends over time
 Pie charts for proportions
 Scatter plots for relationships between two variables
Avoid Misleading Visuals:
Do not manipulate:

 Scales
 Proportions
 Colors

Visualization must reflect the true story of the data.

Good data visualization is clear, simple, accurate, consistent, and focused.


Following these principles ensures that the audience can quickly understand and interpret the data correctly.

Idea of Data Visualization:


The main idea of data visualization is:

 To convert raw data into a visual format such as charts, graphs, or maps.
 To identify patterns, trends, and relationships that are not easy to see in tables.
 To communicate insights effectively to decision-makers or audiences.
 To make large datasets understandable and actionable.

Data visualization helps humans understand data quickly and make informed decisions.

Example:
A sales report with 1,000 rows of data is hard to read.
A line chart showing monthly sales trend makes it easy to understand.

Tools for Data Visualization:


There are many tools used to create visualizations. They can be categorized into programming tools and
software tools.

1. Programming Tools:

These allow flexibility and advanced customization.

 Python Libraries:
o Matplotlib: Basic plotting library for line, bar, scatter, and pie charts.
o Seaborn: Built on Matplotlib; easier and visually better for statistical charts.
o Plotly: Interactive and dynamic charts for dashboards.
o Bokeh: Interactive web visualizations.
 R Libraries:
o ggplot2: Powerful for creating statistical graphics.
o Lattice: Multivariate data visualization.
2. Software Tools:

These are easy to use with minimal coding:

 Microsoft Excel: Simple charts, pivot charts, and dashboards.


 Tableau: Advanced visualization and dashboarding.
 Power BI: Business intelligence tool with interactive dashboards.
 Google Data Studio: Free tool for reports and dashboards.

How to Choose Tool:

 For simple visualization and reports → Excel or Google Data Studio.


 For interactive dashboards and business intelligence → Tableau or Power BI.
 For customized and advanced analytics → Python or R.

CHAPTER 7 (E THICS IN D ATA S CIENCE )

Ethics in Data Science:


Ethics in data science means following moral principles and professional standards when working with data.
It ensures that data collection, analysis, and use are fair, responsible, and safe.
Ethics guides data scientists to use data correctly and avoid harm to people or society.

Example:
Not using personal data without permission.
Avoiding biased models that discriminate against certain groups.

Important:
Ethics is important because:

 Data often contains personal or sensitive information


 Models can affect decisions in healthcare, finance, hiring, or law
 Unethical use of data can harm individuals or society
 Builds trust with users and stakeholders
 Prevents legal and reputational risks

Principles of Ethics:

1. Privacy:

Respect the privacy of individuals whose data is collected.


Ensure personal information is protected and anonymized.

2. Transparency:

Be open about how data is collected, stored, and analyzed.


Explain model decisions in simple terms.
3. Fairness and Bias Prevention:

Ensure models do not discriminate based on gender, race, religion, or other sensitive
attributes.
Check for biased data and correct it.

4. Accountability:

Data scientists should take responsibility for the results of their models and analysis.
Errors or harm caused by models must be addressed.

5. Security:

Protect data from unauthorized access, leaks, or misuse.


Use secure storage and encryption methods.

6. Integrity:

Do not manipulate or misrepresent data to achieve desired results.


Present data and insights honestly.

How to Practice Ethics in Data Science:

 Collect only necessary data


 Anonymize personal information
 Test models for bias
 Document data sources and analysis methods
 Follow legal regulations and organizational policies

Data Science and Ethical Issues:


Data science deals with large amounts of data, including personal and sensitive information.
Because of this, ethical issues can arise in several areas.
Ethics ensures that data scientists act responsibly and protect individuals and society.

The main ethical issues are privacy, security, fairness, and responsibility.

Privacy:

Privacy is about protecting personal data.

 Data scientists must not collect personal information without consent.


 Sensitive data like health records, financial data, or location should be anonymized.
 Users must know how their data is used.

Example:
A company tracking customer location without permission violates privacy.
Security:

Security is about protecting data from unauthorized access or misuse.

 Data must be stored securely using encryption and access controls.


 Prevent hacking, data leaks, or accidental exposure.
 Ensure only authorized personnel can access sensitive information.

Example:
A cloud database with unprotected user data can be stolen, leading to harm.

Ethics:
Ethics in data science ensures responsible, fair, and honest use of data.

 Avoid biased models that discriminate based on race, gender, religion, or age.
 Present findings accurately without manipulation.
 Be accountable for decisions made using data.

Example:
A hiring algorithm rejecting qualified candidates due to biased training data is unethical.

Next Generation Data Scientists:

Next generation data scientists must:

 Understand ethical principles and follow them in all projects.


 Build fair, transparent, and accountable models.
 Protect privacy and security of users.
 Communicate insights honestly and clearly.
 Be aware of the social and legal impact of their work.

Future data scientists should combine technical skills with ethical responsibility to ensure data benefits society
without causing harm.

You might also like