Introduction About Data Science
DS is a Multidisciplinary field that utilizes a wide
range of tools and techniques to extract insights
from data.
DS is the area of study which involves extracting
insights from vast amounts of data using various
scientific methods, algorithm and processes.
Data Science is a multi-disciplinary science with an
objective to perform data analysis to generate
knowledge that can be used for decision making.
The knowledge can be in the form of similar
patterns (such as customer buying behavior) or
predictive planning models (such as predicting
student performance), forecasting models (Sales
Forecasting) etc.
A data science application collects data and
information from multiple heterogenous sources,
cleans, integrates, processes and analyses this data
using various tools and presents information and
knowledge in various visual forms.
DATA SCIENCE LIFE CYCLE:
A DS Project involves a series of steps that must be
followed to achieve the desired results.
Problem Definition
Defining the problem is a critical phase in data
science that sets the direction for the entire project.
It involves gaining domain knowledge, formulating
a clear problem statement, defining the expected
outcome, defining the scope and constraints,
problem framing to data-driven analysis and
modeling, understanding data requirements and
availability, conducting a feasibility assessment,
and aligning with stakeholders.
A well-defined problem provides a solid foundation
for subsequent steps in the data science pipeline,
from data collection and preprocessing to model
development and deployment.
Data Collection
Identify data sources and collection methods,
which may include databases & and data
warehouses, APIs, web scraping, sensor data, social
media, and text and documents.
Data can be either;
Structured data: Data that is organized and presented in
a highly organized, tabular format where each data
point or observation is neatly structured into rows and
columns.
E.g., Relational Databases, spreadsheets, and CSV files.
Unstructured data: Data that lacks a predefined
structure or format, making it more challenging to
analyze and process using traditional techniques. E.g.,
text, images, audio, or videos.
Exploratory Data Analysis (EDA) and Data
Preprocessing
The EDA involves understanding, visualizing, and
analyzing data to gain insights into its
characteristics and relationships.
Data preprocessing involves cleaning,
transforming, and organizing raw data into a format
that is suitable for the next phases.
Proper data analysis and preprocessing can
significantly impact the quality and effectiveness of
the ML models.
This is an iterative process that ensures the data is
accurate, complete, and in the right format for the
algorithms that will be used in the latter phases of
the data science pipeline.
Variable Identification
As the first step, we need to understand the types of
variables in the dataset. Variables can be broadly
categorized as continuous or categorical.
Identifying variable types helps to choose
appropriate techniques for analyzing and
visualizing data.
Continuous data: These are numerical variables that
can take on an infinite number of values within a range.
E.g., age, income, or temperature.
Categorical data: These represent categories or labels
and can take on a limited number of distinct values.
E.g., gender, color, or product type.
Understanding Data
Once the variables are identified, we need to gain a
deeper understanding of the data.
This involves examining the Nature of Data, and
Removing Duplicates.
Data Summary:
o For continuous data - basic statistics like mean,
median, standard deviation, and quartiles;
o For categorical data - frequency distribution of
the categories.
Data Distribution:
o Visualizing the distribution of data through
histograms (for continuous data) or bar charts
(for categorical data).
o This helps to identify patterns and skewness.
Removing Duplicates:
o Detecting and eliminating duplicate records
from further analysis to ensure data quality and
consistency, data accuracy, and unbiased
analysis.
Handling Missing Values:
Missing data is a common issue in real-world
datasets. Missing values can be problematic for
many ML algorithms, as they may not handle them
well. Missing values are to be treated depending on
the context by using various techniques.
Removing rows (data entries) with missing values:
This is appropriate when the missing data is
minimal and won't significantly impact the
analysis.
Removing columns (entire features) with missing
values:
o If many data points are missing in a particular
feature/s, it is better to consider the entire
feature removed from further analysis.
o However, feature importance analysis and/or
domain expertise must be incorporated in such
cases.
Imputation:
o Imputing missing values with estimated or
predicted values. Common methods include
mean, median, mode imputation, or more
sophisticated techniques like regression
imputation.
Use appropriate ML techniques:
o Some ML algorithms can effectively handle
missing values (e.g., Decision tree-based
algorithms). Therefore, it is needed to consider
the intended ML algorithms to be used before
removing/imputing missing values at this
phase.
Visualization and Analysis
Visualizing and analyzing data uncovers patterns and
relationships. This is by means of using appropriate
visualizations and statistical methods for Univariate,
Bivariate, and Multivariate analysis:
Univariate Analysis:
o Examining individual variables in isolation.
Tools such as histograms, box plots, bar charts,
and summary statistics are used to understand
each variable's distribution and characteristics.
Bivariate Analysis:
o Analyzing the relationships between pairs of
variables. Scatter plots, correlation matrices,
and stacked bar charts can reveal how two
variables interact or correlate with each other.
Multivariate Analysis:
o Exploring interactions among three or more
variables. Techniques like heatmaps, 3D plots,
and dimensionality reduction methods (e.g.,
PCA) help visualize complex relationships.
Statistical tests, such as t-tests or chi-squared tests,
may be applied to assess the significance of
observed relationships or differences in the data.
Dealing with Outliers
Detecting and handling outliers is an essential part
of the data science pipeline.
Common methods for outlier detection include Z-
score analysis, the Interquartile Range (IQR)
method, and visual inspection through box plots or
scatter plots.
Once the outliers are identified, further
investigations are needed to understand the nature
of the outliers and whether they are genuine
extreme values or data entry errors.
Outliers are to be treated depending on the context;
they can be removed, transformed, or dealt with
based on the nature of the data and the specific
analysis goals. Some examples are as follows;
Removing outliers: In such cases where the outliers
are identified as data entry errors, measurement
errors, or anomalies that are not representative of
the underlying population, it is considered to
remove them from the dataset.
Transforming data:
o To reduce the impact of the outliers but still
retain the information that they contain, the
transformation of the features can be done.
o This technique is effective when outliers have
a skewed effect on the distribution of data,
making it more symmetric. Appropriate
transformation is to be used, e.g., logarithmic,
square root, or reciprocal transformations.
Imputing outliers: We can use data imputation
when we want to retain the data points but reduce
their impact on analysis. Imputing replaces outlier
values with estimated or imputed values based on
the distribution of the data.
Data capping: We can set a threshold to the data
and replace all values above/below the threshold
with the threshold value itself. Although this
method preserves some information about the
outliers, it will create a biased dataset.
Categorizing outliers: In contrast to all the above
methods, we can preserve the data as it is and
introduce a new categorical variable to flag the
outliers. This approach is effective when outliers
represent a distinct group or have unique
characteristics that are relevant to the analysis.