0% found this document useful (0 votes)

5 views10 pages

Data Science Life Cycle

Data Science (DS) is a multidisciplinary field focused on extracting insights from large datasets using various scientific methods and algorithms. The DS life cycle includes critical steps such as problem definition, data collection, exploratory data analysis, and handling missing values, which are essential for effective model development. Proper data preprocessing and visualization techniques are crucial for uncovering patterns and relationships, ultimately aiding in decision-making.

Uploaded by

SuseeRenu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

Data Science Life Cycle

Uploaded by

SuseeRenu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Introduction About Data Science

 DS is a Multidisciplinary field that utilizes a wide

range of tools and techniques to extract insights
from data.
 DS is the area of study which involves extracting
insights from vast amounts of data using various
scientific methods, algorithm and processes.
 Data Science is a multi-disciplinary science with an
objective to perform data analysis to generate
knowledge that can be used for decision making.
 The knowledge can be in the form of similar
patterns (such as customer buying behavior) or
predictive planning models (such as predicting
student performance), forecasting models (Sales
Forecasting) etc.

 A data science application collects data and

information from multiple heterogenous sources,
cleans, integrates, processes and analyses this data
using various tools and presents information and
knowledge in various visual forms.

DATA SCIENCE LIFE CYCLE:

 A DS Project involves a series of steps that must be
followed to achieve the desired results.

Problem Definition
 Defining the problem is a critical phase in data
science that sets the direction for the entire project.
 It involves gaining domain knowledge, formulating
a clear problem statement, defining the expected
outcome, defining the scope and constraints,
problem framing to data-driven analysis and
modeling, understanding data requirements and
availability, conducting a feasibility assessment,
and aligning with stakeholders.
 A well-defined problem provides a solid foundation
for subsequent steps in the data science pipeline,
from data collection and preprocessing to model
development and deployment.
Data Collection
 Identify data sources and collection methods,
which may include databases & and data
warehouses, APIs, web scraping, sensor data, social
media, and text and documents.

Data can be either;

Structured data: Data that is organized and presented in

a highly organized, tabular format where each data
point or observation is neatly structured into rows and
columns.
E.g., Relational Databases, spreadsheets, and CSV files.

Unstructured data: Data that lacks a predefined

structure or format, making it more challenging to
analyze and process using traditional techniques. E.g.,
text, images, audio, or videos.

Exploratory Data Analysis (EDA) and Data

Preprocessing
 The EDA involves understanding, visualizing, and
analyzing data to gain insights into its
characteristics and relationships.
 Data preprocessing involves cleaning,
transforming, and organizing raw data into a format
that is suitable for the next phases.
 Proper data analysis and preprocessing can
significantly impact the quality and effectiveness of
the ML models.
 This is an iterative process that ensures the data is
accurate, complete, and in the right format for the
algorithms that will be used in the latter phases of
the data science pipeline.

Variable Identification
 As the first step, we need to understand the types of
variables in the dataset. Variables can be broadly
categorized as continuous or categorical.
 Identifying variable types helps to choose
appropriate techniques for analyzing and
visualizing data.

Continuous data: These are numerical variables that

can take on an infinite number of values within a range.
E.g., age, income, or temperature.
Categorical data: These represent categories or labels
and can take on a limited number of distinct values.
E.g., gender, color, or product type.

Understanding Data
 Once the variables are identified, we need to gain a
deeper understanding of the data.
 This involves examining the Nature of Data, and
Removing Duplicates.
 Data Summary:

o For continuous data - basic statistics like mean,

median, standard deviation, and quartiles;

o For categorical data - frequency distribution of

the categories.
 Data Distribution:

o Visualizing the distribution of data through

histograms (for continuous data) or bar charts

(for categorical data).
o This helps to identify patterns and skewness.

 Removing Duplicates:

o Detecting and eliminating duplicate records

from further analysis to ensure data quality and

consistency, data accuracy, and unbiased
analysis.
Handling Missing Values:
 Missing data is a common issue in real-world
datasets. Missing values can be problematic for
many ML algorithms, as they may not handle them
well. Missing values are to be treated depending on
the context by using various techniques.
 Removing rows (data entries) with missing values:

This is appropriate when the missing data is

minimal and won't significantly impact the
analysis.
 Removing columns (entire features) with missing
values:
o If many data points are missing in a particular

feature/s, it is better to consider the entire

feature removed from further analysis.
o However, feature importance analysis and/or

domain expertise must be incorporated in such

cases.
 Imputation:

o Imputing missing values with estimated or

predicted values. Common methods include

mean, median, mode imputation, or more
sophisticated techniques like regression
imputation.
 Use appropriate ML techniques:

o Some ML algorithms can effectively handle

missing values (e.g., Decision tree-based

algorithms). Therefore, it is needed to consider
the intended ML algorithms to be used before
removing/imputing missing values at this
phase.
Visualization and Analysis
Visualizing and analyzing data uncovers patterns and
relationships. This is by means of using appropriate
visualizations and statistical methods for Univariate,
Bivariate, and Multivariate analysis:
 Univariate Analysis:

o Examining individual variables in isolation.

Tools such as histograms, box plots, bar charts,

and summary statistics are used to understand
each variable's distribution and characteristics.
 Bivariate Analysis:
o Analyzing the relationships between pairs of

variables. Scatter plots, correlation matrices,

and stacked bar charts can reveal how two
variables interact or correlate with each other.
 Multivariate Analysis:
o Exploring interactions among three or more

variables. Techniques like heatmaps, 3D plots,

and dimensionality reduction methods (e.g.,
PCA) help visualize complex relationships.
 Statistical tests, such as t-tests or chi-squared tests,
may be applied to assess the significance of
observed relationships or differences in the data.

Dealing with Outliers

 Detecting and handling outliers is an essential part
of the data science pipeline.
 Common methods for outlier detection include Z-
score analysis, the Interquartile Range (IQR)
method, and visual inspection through box plots or
scatter plots.
 Once the outliers are identified, further
investigations are needed to understand the nature
of the outliers and whether they are genuine
extreme values or data entry errors.
 Outliers are to be treated depending on the context;
they can be removed, transformed, or dealt with
based on the nature of the data and the specific
analysis goals. Some examples are as follows;
 Removing outliers: In such cases where the outliers
are identified as data entry errors, measurement
errors, or anomalies that are not representative of
the underlying population, it is considered to
remove them from the dataset.
 Transforming data:
o To reduce the impact of the outliers but still

retain the information that they contain, the

transformation of the features can be done.
o This technique is effective when outliers have

a skewed effect on the distribution of data,

making it more symmetric. Appropriate
transformation is to be used, e.g., logarithmic,
square root, or reciprocal transformations.
 Imputing outliers: We can use data imputation
when we want to retain the data points but reduce
their impact on analysis. Imputing replaces outlier
values with estimated or imputed values based on
the distribution of the data.
 Data capping: We can set a threshold to the data
and replace all values above/below the threshold
with the threshold value itself. Although this
method preserves some information about the
outliers, it will create a biased dataset.
 Categorizing outliers: In contrast to all the above
methods, we can preserve the data as it is and
introduce a new categorical variable to flag the
outliers. This approach is effective when outliers
represent a distinct group or have unique
characteristics that are relevant to the analysis.

Exploratory Data Analysis in Data Science
No ratings yet
Exploratory Data Analysis in Data Science
37 pages
Applied Data Science Module 1
No ratings yet
Applied Data Science Module 1
48 pages
1 - Unit1-Data Science Fundamentals
No ratings yet
1 - Unit1-Data Science Fundamentals
23 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
23 pages
Unit V
No ratings yet
Unit V
10 pages
Analytics Process and Data Handling Guide
No ratings yet
Analytics Process and Data Handling Guide
29 pages
Data Science Process and Steps Explained
No ratings yet
Data Science Process and Steps Explained
8 pages
Lesson 2 Data Science Methodology
No ratings yet
Lesson 2 Data Science Methodology
11 pages
Data Science Pipeline, EDA & Data Preparation
No ratings yet
Data Science Pipeline, EDA & Data Preparation
14 pages
What Is Data Quality?
No ratings yet
What Is Data Quality?
7 pages
Data Science Project Workflow Guide
No ratings yet
Data Science Project Workflow Guide
5 pages
Advanced Data Analytics Assignment Guide
No ratings yet
Advanced Data Analytics Assignment Guide
6 pages
Data Science Methodology Overview
No ratings yet
Data Science Methodology Overview
20 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
Unitrr
No ratings yet
Unitrr
24 pages
Big Data Strategy and Analysis Guide
No ratings yet
Big Data Strategy and Analysis Guide
4 pages
Data Cleaning and Preprocessing Guide
No ratings yet
Data Cleaning and Preprocessing Guide
32 pages
Chapter2-Data Science Processes
No ratings yet
Chapter2-Data Science Processes
34 pages
Data Science Process Overview
No ratings yet
Data Science Process Overview
38 pages
Data Science Overview and Applications
No ratings yet
Data Science Overview and Applications
29 pages
Importance of Exploratory Data Analysis
No ratings yet
Importance of Exploratory Data Analysis
7 pages
Data Science Bootcamp Overview
No ratings yet
Data Science Bootcamp Overview
33 pages
Exploratory Data Analysis in Python
No ratings yet
Exploratory Data Analysis in Python
36 pages
Data Cleaning and Quality Assessment Guide
No ratings yet
Data Cleaning and Quality Assessment Guide
17 pages
EDA Fundamentals and Techniques Overview
No ratings yet
EDA Fundamentals and Techniques Overview
47 pages
Data Science Methodology for Capstone Project
No ratings yet
Data Science Methodology for Capstone Project
36 pages
M3 - Big Data - Analytics Lifecycle
No ratings yet
M3 - Big Data - Analytics Lifecycle
36 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
12 pages
Data Science Process Explained
No ratings yet
Data Science Process Explained
21 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
40 pages
Data Science Project Flow Chart
No ratings yet
Data Science Project Flow Chart
7 pages
Comprehensive Guide to Data Mining
No ratings yet
Comprehensive Guide to Data Mining
52 pages
Understanding Data Analytics Essentials
No ratings yet
Understanding Data Analytics Essentials
42 pages
Statistics for Data Science Overview
100% (3)
Statistics for Data Science Overview
39 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
6 pages
Exploratory Data Analysi EDA
No ratings yet
Exploratory Data Analysi EDA
18 pages
Unit-1 Data Science Methodology
No ratings yet
Unit-1 Data Science Methodology
16 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
16 pages
Understanding Data Science Lifecycle
No ratings yet
Understanding Data Science Lifecycle
30 pages
Understanding Machine Learning Basics
No ratings yet
Understanding Machine Learning Basics
41 pages
Understanding Exploratory Data Analysis
No ratings yet
Understanding Exploratory Data Analysis
20 pages
Data Preparation for Machine Learning
No ratings yet
Data Preparation for Machine Learning
62 pages
DataScience PDF
No ratings yet
DataScience PDF
36 pages
Importance of Data Exploration Techniques
No ratings yet
Importance of Data Exploration Techniques
48 pages
EDA in SAS: Communicating Insights
No ratings yet
EDA in SAS: Communicating Insights
25 pages
Overview of Data Science Concepts
No ratings yet
Overview of Data Science Concepts
7 pages
Data Acquisition Techniques in Data Science
No ratings yet
Data Acquisition Techniques in Data Science
57 pages
Principles of Mathematical Data Science
No ratings yet
Principles of Mathematical Data Science
12 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
226 pages
Key Components of the Data Science Process
No ratings yet
Key Components of the Data Science Process
49 pages
Data Analysis Techniques and Methods
No ratings yet
Data Analysis Techniques and Methods
82 pages
EDA Fundamentals: Data Analysis Guide
No ratings yet
EDA Fundamentals: Data Analysis Guide
10 pages
Data Science: Analyzing Booking Trends
No ratings yet
Data Science: Analyzing Booking Trends
73 pages
Data Science and Machine Learning Overview
No ratings yet
Data Science and Machine Learning Overview
55 pages
Understanding the Data Science Process
No ratings yet
Understanding the Data Science Process
30 pages
Data Science Foundations Overview
No ratings yet
Data Science Foundations Overview
5 pages
EDA and Descriptive Statistics Overview
100% (1)
EDA and Descriptive Statistics Overview
209 pages
Data Science Concepts and Applications
No ratings yet
Data Science Concepts and Applications
30 pages
Data Preparation for Marketing Analytics
No ratings yet
Data Preparation for Marketing Analytics
57 pages
Unit 1: Computer Fundamentals Overview
No ratings yet
Unit 1: Computer Fundamentals Overview
15 pages
Databricks Data Engineering Course Guide
100% (1)
Databricks Data Engineering Course Guide
4 pages
Iceberg vs Delta Lake: Key Comparisons
No ratings yet
Iceberg vs Delta Lake: Key Comparisons
6 pages
US Copyright Office: Renews-05-08
No ratings yet
US Copyright Office: Renews-05-08
4 pages
CSNOtk Release Note v7.5
No ratings yet
CSNOtk Release Note v7.5
2 pages
Database Systems Course Syllabus CoSc2041
No ratings yet
Database Systems Course Syllabus CoSc2041
4 pages
Arcserve's StorageCraft Data Loss Recovery
No ratings yet
Arcserve's StorageCraft Data Loss Recovery
2 pages
Overview of NoSQL Databases
No ratings yet
Overview of NoSQL Databases
4 pages
Data Architecture Roadmap Template
No ratings yet
Data Architecture Roadmap Template
18 pages
Formula Negócio Online Download
No ratings yet
Formula Negócio Online Download
1 page
Excel VLOOKUP, HLOOKUP, and Pivot Tables
No ratings yet
Excel VLOOKUP, HLOOKUP, and Pivot Tables
16 pages
DBMS Questions & Answers PDF 2022
100% (1)
DBMS Questions & Answers PDF 2022
35 pages
Bank Salary System Overview
No ratings yet
Bank Salary System Overview
8 pages
Evolution of Human-Computer Interaction
No ratings yet
Evolution of Human-Computer Interaction
6 pages
Last Mile Delivery Management System
No ratings yet
Last Mile Delivery Management System
1 page
Spreadsheet Data Analysis and SQL Queries
No ratings yet
Spreadsheet Data Analysis and SQL Queries
7 pages
Endnote Tutorial v2
No ratings yet
Endnote Tutorial v2
32 pages
Database Administrator Profile: Pratik Prakash
No ratings yet
Database Administrator Profile: Pratik Prakash
2 pages
Snowflake Snowpro Core Certification - Cof-C02 Free Exam Dumps Questions & Answers
No ratings yet
Snowflake Snowpro Core Certification - Cof-C02 Free Exam Dumps Questions & Answers
5 pages
Dbms Final Mcqs
No ratings yet
Dbms Final Mcqs
37 pages
Business Process Data Analysis Techniques
No ratings yet
Business Process Data Analysis Techniques
29 pages
FIU AML Transformation Architecture
No ratings yet
FIU AML Transformation Architecture
2 pages
MESIntelligence Reports User Guide
No ratings yet
MESIntelligence Reports User Guide
93 pages
Bulacan SK Voters List 2023
No ratings yet
Bulacan SK Voters List 2023
7 pages
Decentralized Patient Data Management
No ratings yet
Decentralized Patient Data Management
68 pages
HANA DB Memory and Disk Growth Analysis
No ratings yet
HANA DB Memory and Disk Growth Analysis
16 pages
Navigating Websites: A Lesson Plan
80% (5)
Navigating Websites: A Lesson Plan
5 pages
Data Science Fundamentals and Applications
No ratings yet
Data Science Fundamentals and Applications
4 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
59 pages
Azure Computer Vision Features Overview
No ratings yet
Azure Computer Vision Features Overview
19 pages

Data Science Life Cycle

Uploaded by

Data Science Life Cycle

Uploaded by

Introduction About Data Science

 DS is a Multidisciplinary field that utilizes a wide

 A data science application collects data and

DATA SCIENCE LIFE CYCLE:

Data can be either;

Structured data: Data that is organized and presented in

Unstructured data: Data that lacks a predefined

Exploratory Data Analysis (EDA) and Data

Continuous data: These are numerical variables that

o For continuous data - basic statistics like mean,

median, standard deviation, and quartiles;

o Visualizing the distribution of data through

histograms (for continuous data) or bar charts

o Detecting and eliminating duplicate records

from further analysis to ensure data quality and

This is appropriate when the missing data is

feature/s, it is better to consider the entire

domain expertise must be incorporated in such

o Imputing missing values with estimated or

predicted values. Common methods include

o Some ML algorithms can effectively handle

missing values (e.g., Decision tree-based

o Examining individual variables in isolation.

Tools such as histograms, box plots, bar charts,

variables. Scatter plots, correlation matrices,

variables. Techniques like heatmaps, 3D plots,

Dealing with Outliers

retain the information that they contain, the

a skewed effect on the distribution of data,

You might also like