Introduction To Data Science – Detailed Notes
Introduction to Data Science
1. What is Data Science?
Data Science is an interdisciplinary field that focuses on extracting meaningful
information, patterns, and knowledge from data. It combines concepts from statistics,
mathematics, computer science, and domain expertise to analyze both structured
and unstructured data.
In simple words, Data Science is the art and science of turning raw data into useful
insights for decision-making and prediction.
1.1 Definition of Data Science
Data Science is a systematic process of collecting, cleaning, analyzing, modeling, and
interpreting data to solve real-world problems and support informed decision-making.
1.2 Characteristics of Data Science
• Works with large volumes of data (Big Data)
• Uses statistical and machine learning techniques
• Supports prediction and forecasting
• Helps in strategic and operational decisions
• Applicable across multiple domains
1.3 Importance of Data Science
• Organizations generate huge amounts of data every day
• Manual analysis is not possible
• Data Science helps discover hidden trends and patterns
• Improves accuracy, efficiency, and productivity
• Enables automation and intelligent systems
2. Applications of Data Science
Data Science is widely applied in almost every sector of society. Some important
applications are discussed below.
2.1 Applications of Data Science in Business
In business, Data Science helps organizations improve customer satisfaction, reduce
costs, and increase profits.
Major Applications:
• Customer behavior analysis
• Market basket analysis
• Sales and demand forecasting
• Customer churn prediction
• Recommendation systems
• Fraud detection
• Inventory management
Example: Online shopping platforms like Amazon and Flipkart use Data Science to
recommend products based on customer browsing and purchase history.
2.2 Applications of Data Science in Healthcare
Data Science plays a crucial role in improving healthcare services and patient
outcomes.
Major Applications:
• Disease prediction and diagnosis
• Medical image processing (X-rays, MRI)
• Personalized medicine
• Drug discovery
• Patient monitoring systems
• Hospital resource management
Example: Predicting the risk of heart disease or diabetes using patient health records.
2.3 Applications of Data Science in Finance
Financial institutions heavily rely on Data Science for risk management and fraud
detection.
Major Applications:
• Credit scoring
• Risk assessment
• Fraud detection
• Stock market analysis
• Algorithmic trading
• Loan approval systems
Example: Banks use Data Science to detect unusual credit card transactions and
prevent fraud.
2.4 Other Applications of Data Science
• Education: Student performance analysis
• Transportation: Traffic prediction and route optimization
• Government: Policy making and public welfare analysis
• Sports: Player performance and team strategy
• Social Media: Sentiment analysis and trend detection
3. Data Science Workflow
The Data Science workflow is a structured sequence of steps followed to solve a data-
related problem effectively.
3.1 Data Collection
Data collection is the first step of the Data Science process. It involves gathering raw
data from various sources.
Sources of Data:
• Databases
• Surveys and questionnaires
• Sensors and IoT devices
• Web scraping
• Social media platforms
• Transaction records
Importance:
• Quality of analysis depends on quality of data
• Incomplete or incorrect data leads to poor results
3.2 Data Processing
Data processing converts raw data into a usable format.
Activities Included:
• Data integration
• Data formatting
• Removing errors
• Data transformation
3.3 Data Analysis
Data analysis focuses on understanding the data and discovering patterns.
Techniques Used:
• Descriptive statistics (mean, median, mode)
• Graphical analysis (histograms, bar charts)
• Exploratory Data Analysis (EDA)
Purpose:
• Identify trends and relationships
• Detect anomalies and outliers
3.4 Data Modeling
Data modeling involves applying statistical or machine learning models to data.
Common Models:
• Regression models
• Classification models
• Clustering models
Objective:
• Make predictions
• Classify data
• Discover hidden structures
3.5 Interpretation and Communication
The final step is interpreting results and communicating insights to stakeholders.
Tools Used:
• Reports
• Dashboards
• Data visualization
4. Data Collection and Preprocessing
Data collection and preprocessing are essential steps to ensure data quality.
5. Types of Data
5.1 Structured Data
Structured data is organized in a fixed format, usually in rows and columns.
Examples:
• Relational databases
• Excel sheets
• CSV files
Advantages:
• Easy to store and analyze
• Suitable for statistical analysis
5.2 Unstructured Data
Unstructured data does not have a predefined format.
Examples:
• Text documents
• Images
• Videos
• Emails
• Social media posts
Challenges:
• Difficult to analyze
• Requires advanced techniques
6. Data Formats
6.1 CSV (Comma Separated Values)
• Simple text-based format
• Values separated by commas
• Widely used in Data Science
6.2 JSON (JavaScript Object Notation)
• Data stored in key-value pairs
• Commonly used in APIs and web applications
6.3 Excel
• Spreadsheet-based format
• Easy to use
• Suitable for small to medium datasets
7. Data Cleaning
Data cleaning is the process of identifying and correcting errors in data.
7.1 Handling Missing Values
Missing values occur when data is unavailable.
Methods:
• Removing rows or columns
• Filling with mean, median, or mode
• Interpolation
7.2 Handling Duplicate Data
Duplicate records can distort analysis results.
Solution:
• Identify duplicates
• Remove redundant entries
7.3 Handling Outliers
Outliers are extreme values that differ significantly from other observations.
Methods:
• Removing outliers
• Replacing with median
• Data transformation
8. Basic Data Preprocessing Techniques
8.1 Data Normalization
Normalization scales data to a fixed range.
8.2 Data Standardization
Standardization converts data to mean 0 and standard deviation 1.
8.3 Encoding Categorical Data
Converts categorical data into numerical form.
8.4 Feature Selection
Selecting the most important variables for modeling.
8.5 Data Transformation
Techniques such as log and square root transformation are used to reduce skewness.
Exploratory Data Analysis (EDA)
1. Introduction to Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a fundamental step in the field of Data Science,
Statistics, and Data Analytics. It focuses on exploring datasets to summarize their
main characteristics, often using visual methods. EDA is performed before applying
formal statistical modeling or machine learning algorithms.
EDA helps analysts and researchers understand the nature of data, identify errors,
detect outliers, discover patterns, and check assumptions. It plays a critical role in
improving data quality and ensuring accurate analysis.
Definition of EDA
Exploratory Data Analysis is the process of examining datasets using descriptive
statistics and visualization techniques to understand structure, patterns,
relationships, and anomalies in data.
EDA was popularized by John W. Tukey, who emphasized that data should be explored
visually and numerically before applying confirmatory analysis.
2. Objectives of Exploratory Data Analysis
The main objectives of EDA are:
• To understand the overall structure of the dataset
• To summarize key characteristics of data
• To identify missing values and inconsistencies
• To detect outliers and extreme values
• To study relationships between variables
• To identify trends and patterns
• To prepare data for further statistical analysis or modeling
EDA answers important questions such as:
• What does the data look like?
• Are there any errors in the data?
• How are variables related?
• Is the data suitable for modeling?
3. Importance of EDA in Data Science
EDA is essential for effective data-driven decision-making.
Importance:
• Improves understanding of data
• Reduces risk of incorrect conclusions
• Helps in selecting suitable models
• Detects data quality issues early
• Saves time and resources
• Enhances communication of insights
Without EDA, analysts may apply inappropriate models or misinterpret results.
4. Types of Exploratory Data Analysis
EDA techniques can be broadly classified into the following types:
4.1 Univariate Analysis
• Involves analysis of a single variable
• Focuses on distribution, central tendency, and variability
• Uses statistics like mean, median, and standard deviation
4.2 Bivariate Analysis
• Involves two variables
• Focuses on relationships between variables
• Commonly uses scatter plots and correlation analysis
4.3 Multivariate Analysis
• Involves more than two variables
• Examines complex relationships and interactions
• Uses advanced visualization and statistical techniques
5. Descriptive Statistics
Descriptive statistics are numerical measures used to summarize and describe the
main features of a dataset.
5.1 Measures of Central Tendency
These measures indicate the central or typical value of data.
(a) Mean
• Arithmetic average of all observations
• Formula:
Mean = (Sum of all values) / (Number of values)
• Highly affected by extreme values (outliers)
(b) Median
• Middle value when data is arranged in ascending or descending order
• Less affected by outliers
(c) Mode
• Most frequently occurring value
• Useful for categorical and discrete data
5.2 Measures of Dispersion
These measures describe the spread or variability of data.
(a) Range
• Difference between maximum and minimum values
(b) Variance
• Average of squared deviations from the mean
• Indicates data variability
(c) Standard Deviation
• Square root of variance
• Shows how much values deviate from the mean
5.3 Measures of Shape
These measures describe the shape of data distribution.
(a) Skewness
• Measures asymmetry of data
• Types:
o Positive skewness (right-skewed)
o Negative skewness (left-skewed)
o Zero skewness (symmetrical)
(b) Kurtosis
• Measures peakedness or flatness of distribution
• Types:
o Leptokurtic (sharp peak)
o Platykurtic (flat)
o Mesokurtic (normal)
6. Data Visualization in EDA
Data visualization is a graphical representation of data that helps identify trends,
patterns, relationships, and outliers.
Benefits of Data Visualization
• Makes complex data easy to understand
• Identifies hidden patterns
• Enhances communication of results
• Supports quick decision-making
7. Histogram
Definition
A histogram is a graphical representation of numerical data where values are grouped
into bins and displayed using bars.
Characteristics of Histogram
• X-axis represents class intervals
• Y-axis represents frequency
• Bars touch each other
Uses of Histogram
• Understand data distribution
• Identify skewness
• Detect outliers
• Compare frequency distributions
Advantages
• Simple and easy to interpret
• Shows overall distribution clearly
Limitations
• Choice of bin width affects interpretation
• Not suitable for categorical data
8. Box Plot
Definition
A box plot (also called box-and-whisker plot) is a visual tool that summarizes data using
five-number summary.
Components of Box Plot
• Minimum value
• First Quartile (Q1)
• Median (Q2)
• Third Quartile (Q3)
• Maximum value
Uses of Box Plot
• Identify outliers
• Compare distributions
• Understand data spread
Advantages
• Compact representation
• Easy comparison across datasets
Limitations
• Does not show detailed distribution shape
9. Scatter Plot
Definition
A scatter plot displays the relationship between two numerical variables using points on
a Cartesian plane.
Characteristics
• Each point represents one observation
• X-axis shows independent variable
• Y-axis shows dependent variable
Uses of Scatter Plot
• Identify relationships
• Detect correlation
• Observe trends and clusters
Advantages
• Simple and powerful
• Shows relationship clearly
Limitations
• Not suitable for large datasets without clustering
10. Correlation Analysis
Definition
Correlation analysis measures the strength and direction of relationship between two
variables.
Types of Correlation
• Positive correlation
• Negative correlation
• Zero correlation
Correlation Coefficient
• Pearson’s correlation coefficient (r)
• Range: -1 to +1
Interpretation
• r = +1 → Perfect positive correlation
• r = -1 → Perfect negative correlation
• r = 0 → No correlation
Importance of Correlation Analysis
• Identifies related variables
• Helps in feature selection
• Supports prediction models
11. Trend Analysis
Definition
Trend analysis studies data over time to identify long-term patterns or movements.
Types of Trends
• Upward trend
• Downward trend
• Horizontal (no trend)
• Seasonal trend
Applications of Trend Analysis
• Sales forecasting
• Economic analysis
• Financial market prediction
• Demand planning
12. Role of EDA in Modeling
EDA supports modeling by:
• Identifying important features
• Checking assumptions
• Selecting suitable algorithms
• Improving accuracy and performance
13. Advantages of Exploratory Data Analysis
• Better understanding of data
• Early detection of data issues
• Improved model performance
• Effective visualization and communication
14. Limitations of Exploratory Data Analysis
• Subjective interpretation
• Time-consuming for large datasets
• Cannot replace confirmatory analysis
15. Conclusion
Exploratory Data Analysis is a critical step in data analysis that uses descriptive
statistics and visualization techniques such as histograms, box plots, and scatter plots.
Correlation and trend analysis help identify relationships and patterns in data. Proper
EDA ensures high-quality data, accurate analysis, and reliable decision-making, making
it an essential component of Data Science and Analytics.
1. Introduction to Analytics
Analytics is the systematic use of data, statistical analysis, and computational
techniques to discover patterns, generate insights, and support decision-making. In
today’s digital world, organizations generate massive amounts of data, and analytics
helps convert this raw data into meaningful information.
In simple terms:
Analytics is the science of analyzing data to answer questions, solve problems, and
make better decisions.
Analytics is widely used in business, healthcare, finance, education, manufacturing,
and government sectors.
2. Importance of Analytics
Analytics plays a crucial role in modern organizations.
Importance of Analytics:
• Helps in data-driven decision making
• Improves efficiency and productivity
• Reduces uncertainty and risk
• Identifies trends and opportunities
• Enhances customer satisfaction
• Supports strategic planning
Without analytics, organizations rely only on intuition, which may lead to incorrect
decisions.
3. Types of Analytics
Analytics can be broadly classified into four major types based on the type of
questions they answer.
4. Descriptive Analytics
Definition
Descriptive analytics focuses on understanding what has happened in the past by
summarizing historical data.
Descriptive analytics answers the question: “What happened?”
Characteristics
• Uses historical data
• Summarizes data using statistics and visualizations
• Provides insights into past performance
Techniques Used
• Mean, median, mode
• Percentages and ratios
• Tables and charts
• Dashboards and reports
Examples
• Monthly sales reports
• Student result analysis
• Website traffic summary
• Profit and loss statements
Advantages
• Simple and easy to understand
• Helps monitor performance
• Foundation for advanced analytics
Limitations
• Does not predict future outcomes
• Does not suggest actions
5. Predictive Analytics
Definition
Predictive analytics focuses on forecasting future outcomes based on historical data
and patterns.
Predictive analytics answers the question: “What is likely to happen?”
Characteristics
• Uses statistical and machine learning models
• Identifies patterns and trends
• Estimates probabilities of future events
Techniques Used
• Regression analysis
• Classification models
• Time series analysis
• Machine learning algorithms
Examples
• Sales forecasting
• Weather prediction
• Credit risk assessment
• Disease prediction
• Customer churn prediction
Advantages
• Helps plan for the future
• Reduces uncertainty
• Improves decision quality
Limitations
• Predictions are probabilistic, not exact
• Depends on quality of historical data
6. Prescriptive Analytics
Definition
Prescriptive analytics focuses on recommending actions based on predictions and
constraints.
Prescriptive analytics answers the question: “What should be done?”
Characteristics
• Combines descriptive and predictive analytics
• Uses optimization and simulation
• Suggests best possible actions
Techniques Used
• Optimization models
• Simulation
• Decision trees
• Machine learning with business rules
Examples
• Route optimization for delivery vehicles
• Inventory management decisions
• Pricing strategies
• Treatment recommendations in healthcare
Advantages
• Provides actionable insights
• Supports automated decision-making
• Maximizes efficiency and outcomes
Limitations
• Complex to implement
• Requires advanced tools and expertise
7. Comparison of Types of Analytics
Type of Analytics Question Answered Focus
Descriptive What happened? Past
Predictive What will happen? Future
Prescriptive What should be done? Action
8. Introduction to Machine Learning
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that enables
computers to learn from data and improve performance without being explicitly
programmed.
Definition of Machine Learning
Machine Learning is a technique that allows systems to automatically learn patterns
from data and make predictions or decisions.
Machine learning is the backbone of modern analytics and intelligent systems.
9. Relationship Between Analytics and Machine Learning
• Analytics focuses on extracting insights from data
• Machine Learning focuses on building models that learn from data
• Predictive and prescriptive analytics heavily use machine learning
• ML automates and improves analytical processes
10. Types of Machine Learning
Machine learning can be broadly classified into:
• Supervised learning
• Unsupervised learning
• Reinforcement learning
In this chapter, we focus on supervised learning.
11. Introduction to Supervised Learning
Definition
Supervised learning is a type of machine learning where models are trained using
labeled data.
In supervised learning, the algorithm learns a mapping between input variables and
known output variables.
Characteristics
• Uses labeled datasets
• Output variable is known
• Goal is prediction or classification
Examples
• Predicting house prices
• Classifying emails as spam or not spam
• Predicting student performance
12. Types of Supervised Learning
Supervised learning problems are mainly divided into:
• Regression
• Classification
13. Regression
Definition
Regression is a supervised learning technique used to predict a continuous numerical
value.
Regression answers the question: “How much?” or “What is the value?”
Examples
• Predicting salary
• Estimating house prices
• Forecasting sales
Types of Regression
• Simple Linear Regression
• Multiple Linear Regression
• Polynomial Regression
Characteristics
• Output is continuous
• Models relationship between variables
• Uses error minimization techniques
Applications
• Business forecasting
• Economic analysis
• Risk estimation
14. Classification
Definition
Classification is a supervised learning technique used to predict categorical class
labels.
Classification answers the question: “Which category does it belong to?”
Examples
• Spam vs non-spam email
• Disease diagnosis (positive/negative)
• Customer churn (yes/no)
Types of Classification Algorithms
• Logistic Regression
• Decision Trees
• K-Nearest Neighbors (KNN)
• Support Vector Machines (SVM)
• Naïve Bayes
Characteristics
• Output is categorical
• Assigns class labels
• Often uses probability estimates
Applications
• Fraud detection
• Medical diagnosis
• Sentiment analysis
15. Comparison Between Regression and Classification
Aspect Regression Classification
Output Continuous Categorical
Example Price prediction Spam detection
Use Case Forecasting Decision making
16. Advantages of Supervised Learning
• High accuracy with labeled data
• Easy to evaluate performance
• Widely used in real-world applications
17. Limitations of Supervised Learning
• Requires labeled data
• Time-consuming data preparation
• Performance depends on data quality