0% found this document useful (0 votes)
5 views10 pages

Data Science Life Cycle

Data Science (DS) is a multidisciplinary field focused on extracting insights from large datasets using various scientific methods and algorithms. The DS life cycle includes critical steps such as problem definition, data collection, exploratory data analysis, and handling missing values, which are essential for effective model development. Proper data preprocessing and visualization techniques are crucial for uncovering patterns and relationships, ultimately aiding in decision-making.

Uploaded by

SuseeRenu
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Data Science Life Cycle

Data Science (DS) is a multidisciplinary field focused on extracting insights from large datasets using various scientific methods and algorithms. The DS life cycle includes critical steps such as problem definition, data collection, exploratory data analysis, and handling missing values, which are essential for effective model development. Proper data preprocessing and visualization techniques are crucial for uncovering patterns and relationships, ultimately aiding in decision-making.

Uploaded by

SuseeRenu
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Introduction About Data Science

 DS is a Multidisciplinary field that utilizes a wide


range of tools and techniques to extract insights
from data.
 DS is the area of study which involves extracting
insights from vast amounts of data using various
scientific methods, algorithm and processes.
 Data Science is a multi-disciplinary science with an
objective to perform data analysis to generate
knowledge that can be used for decision making.
 The knowledge can be in the form of similar
patterns (such as customer buying behavior) or
predictive planning models (such as predicting
student performance), forecasting models (Sales
Forecasting) etc.

 A data science application collects data and


information from multiple heterogenous sources,
cleans, integrates, processes and analyses this data
using various tools and presents information and
knowledge in various visual forms.

DATA SCIENCE LIFE CYCLE:


 A DS Project involves a series of steps that must be
followed to achieve the desired results.

Problem Definition
 Defining the problem is a critical phase in data
science that sets the direction for the entire project.
 It involves gaining domain knowledge, formulating
a clear problem statement, defining the expected
outcome, defining the scope and constraints,
problem framing to data-driven analysis and
modeling, understanding data requirements and
availability, conducting a feasibility assessment,
and aligning with stakeholders.
 A well-defined problem provides a solid foundation
for subsequent steps in the data science pipeline,
from data collection and preprocessing to model
development and deployment.
Data Collection
 Identify data sources and collection methods,
which may include databases & and data
warehouses, APIs, web scraping, sensor data, social
media, and text and documents.

Data can be either;

Structured data: Data that is organized and presented in


a highly organized, tabular format where each data
point or observation is neatly structured into rows and
columns.
E.g., Relational Databases, spreadsheets, and CSV files.

Unstructured data: Data that lacks a predefined


structure or format, making it more challenging to
analyze and process using traditional techniques. E.g.,
text, images, audio, or videos.

Exploratory Data Analysis (EDA) and Data


Preprocessing
 The EDA involves understanding, visualizing, and
analyzing data to gain insights into its
characteristics and relationships.
 Data preprocessing involves cleaning,
transforming, and organizing raw data into a format
that is suitable for the next phases.
 Proper data analysis and preprocessing can
significantly impact the quality and effectiveness of
the ML models.
 This is an iterative process that ensures the data is
accurate, complete, and in the right format for the
algorithms that will be used in the latter phases of
the data science pipeline.

Variable Identification
 As the first step, we need to understand the types of
variables in the dataset. Variables can be broadly
categorized as continuous or categorical.
 Identifying variable types helps to choose
appropriate techniques for analyzing and
visualizing data.

Continuous data: These are numerical variables that


can take on an infinite number of values within a range.
E.g., age, income, or temperature.
Categorical data: These represent categories or labels
and can take on a limited number of distinct values.
E.g., gender, color, or product type.

Understanding Data
 Once the variables are identified, we need to gain a
deeper understanding of the data.
 This involves examining the Nature of Data, and
Removing Duplicates.
 Data Summary:

o For continuous data - basic statistics like mean,

median, standard deviation, and quartiles;


o For categorical data - frequency distribution of

the categories.
 Data Distribution:

o Visualizing the distribution of data through

histograms (for continuous data) or bar charts


(for categorical data).
o This helps to identify patterns and skewness.

 Removing Duplicates:

o Detecting and eliminating duplicate records

from further analysis to ensure data quality and


consistency, data accuracy, and unbiased
analysis.
Handling Missing Values:
 Missing data is a common issue in real-world
datasets. Missing values can be problematic for
many ML algorithms, as they may not handle them
well. Missing values are to be treated depending on
the context by using various techniques.
 Removing rows (data entries) with missing values:

This is appropriate when the missing data is


minimal and won't significantly impact the
analysis.
 Removing columns (entire features) with missing
values:
o If many data points are missing in a particular

feature/s, it is better to consider the entire


feature removed from further analysis.
o However, feature importance analysis and/or

domain expertise must be incorporated in such


cases.
 Imputation:

o Imputing missing values with estimated or

predicted values. Common methods include


mean, median, mode imputation, or more
sophisticated techniques like regression
imputation.
 Use appropriate ML techniques:

o Some ML algorithms can effectively handle

missing values (e.g., Decision tree-based


algorithms). Therefore, it is needed to consider
the intended ML algorithms to be used before
removing/imputing missing values at this
phase.
Visualization and Analysis
Visualizing and analyzing data uncovers patterns and
relationships. This is by means of using appropriate
visualizations and statistical methods for Univariate,
Bivariate, and Multivariate analysis:
 Univariate Analysis:

o Examining individual variables in isolation.

Tools such as histograms, box plots, bar charts,


and summary statistics are used to understand
each variable's distribution and characteristics.
 Bivariate Analysis:
o Analyzing the relationships between pairs of

variables. Scatter plots, correlation matrices,


and stacked bar charts can reveal how two
variables interact or correlate with each other.
 Multivariate Analysis:
o Exploring interactions among three or more

variables. Techniques like heatmaps, 3D plots,


and dimensionality reduction methods (e.g.,
PCA) help visualize complex relationships.
 Statistical tests, such as t-tests or chi-squared tests,
may be applied to assess the significance of
observed relationships or differences in the data.

Dealing with Outliers


 Detecting and handling outliers is an essential part
of the data science pipeline.
 Common methods for outlier detection include Z-
score analysis, the Interquartile Range (IQR)
method, and visual inspection through box plots or
scatter plots.
 Once the outliers are identified, further
investigations are needed to understand the nature
of the outliers and whether they are genuine
extreme values or data entry errors.
 Outliers are to be treated depending on the context;
they can be removed, transformed, or dealt with
based on the nature of the data and the specific
analysis goals. Some examples are as follows;
 Removing outliers: In such cases where the outliers
are identified as data entry errors, measurement
errors, or anomalies that are not representative of
the underlying population, it is considered to
remove them from the dataset.
 Transforming data:
o To reduce the impact of the outliers but still

retain the information that they contain, the


transformation of the features can be done.
o This technique is effective when outliers have

a skewed effect on the distribution of data,


making it more symmetric. Appropriate
transformation is to be used, e.g., logarithmic,
square root, or reciprocal transformations.
 Imputing outliers: We can use data imputation
when we want to retain the data points but reduce
their impact on analysis. Imputing replaces outlier
values with estimated or imputed values based on
the distribution of the data.
 Data capping: We can set a threshold to the data
and replace all values above/below the threshold
with the threshold value itself. Although this
method preserves some information about the
outliers, it will create a biased dataset.
 Categorizing outliers: In contrast to all the above
methods, we can preserve the data as it is and
introduce a new categorical variable to flag the
outliers. This approach is effective when outliers
represent a distinct group or have unique
characteristics that are relevant to the analysis.

You might also like