Data Science Concepts and Applications
Data Science Concepts and Applications
Data cleaning? 6 a) List any two application of Data Science 1 g) What is one hot coding? 7
Data science is an interdisciplinary Data cleaning (or data cleansing) Two major applications are recommendation One-hot encoding is a data transformation technique used
field that uses scientific methods, is the process of detecting, engines (like on Netflix or Amazon) and fraud to convert categorical variables into a numerical format. It
processes, algorithms, and systems correcting, or removing corrupt, detection (used by banks and credit card creates a new binary (0 or 1) column for each unique
to extract knowledge and insights inaccurate, or irrelevant records companies). category in the original variable.
from structured and unstructured from a dataset to improve its b) What is outlier? 2 h) What is the use of Bubble plot? 8
data. quality. An outlier is a data point that is significantly A bubble plot is used to visualize three dimensions of data.
b) Define Data source? 2 g) Define Hypothesis Testing? 7 different or distant from the other observations It's a variation of a scatter plot where the X and Y axes
A data source is the origin from Hypothesis testing is a statistical in a dataset. It's an observation that lies an represent two variables, and the size of the bubble
which data is collected. This can be a method used to make decisions abnormal distance from other values. represents a third variable.
database, a file (like CSV or Excel), by testing a claim (or c) What is missing values? 3 i) Define Data visualisation. 9
a web API, sensors, or any system hypothesis) about a population Missing values refer to the absence of data for Data visualization is the graphical representation of data
that generates or holds data. based on sample data. It involves a specific variable or observation. In a dataset, and information. It uses visual elements like charts, graphs,
c) What is missing values? 3 testing a "null hypothesis" they are often represented as NULL, NA, or a and maps to help users easily identify patterns, trends, and
Missing values refer to the absence against an "alternative blank space. outliers in data.
of data for a specific variable or hypothesis." d) Define variance. 4 j) Define Standard deviation? 10
observation. In a dataset, they are h) What is use of Bubble plot? 8 Variance is a statistical measurement of the Standard deviation is a statistic that measures the amount
often represented as NULL, NA, or a A bubble plot is used to visualize spread between numbers in a data set. It of variation or dispersion of a set of values. It is the square
blank space and must be handled three dimensions of data. It's a quantifies how far each number in the set is root of the variance and indicates how much, on average,
during data preprocessing. variation of a scatter plot where from the average (mean) and, by extension, data points differ from the mean.
d) List the visualization libraries in the X and Y axes represent two from every other number in the set.
e) What is nominal attribute? 5 a) What do you mean by Primary Data? 1
python. 4 variables, and the size of the Primary data is data that is collected firsthand by a
Common Python visualization bubble represents a third A nominal attribute is a qualitative or
categorical variable where the categories have researcher specifically for their own research purpose. It
libraries include Matplotlib, variable. is data that has not been collected before, often gathered
Seaborn, Plotly, and Bokeh. i) Define standard deviation? 9 no intrinsic order or ranking. Examples
include colors (e.g., 'Red', 'Blue', 'Green') or through surveys, experiments, or direct observation.
e) List applications of data science. 5 Standard deviation is a statistic b) What do you mean by Data Quality? 2
Applications include: that measures the amount of types of fruit (e.g., 'Apple', 'Orange',
'Banana'). Data quality is a measure of the data's condition based
• Recommendation engines (e.g., variation or dispersion of a set of on its fitness to serve a specific purpose. It is assessed
Netflix, Amazon) values. A low standard deviation f) What is data transformation? 6
Data transformation is the process of using factors like accuracy, completeness, consistency,
• Fraud and risk detection (banking) means the values are close to the and timeliness.
• Medical diagnosis and drug average (mean), while a high converting data from one format or structure
discovery standard deviation means the into another, more suitable format for analysis
• Image and speech recognition values are spread out over a or modeling. This includes techniques like
• Targeted advertising wider range. scaling, normalization, or encoding.
c) Define outlier. 3 i) What is tag cloud? 9 c) What is data cube? 12
An outlier is a data point that differs significantly A tag cloud (or word cloud) is a data A data cube is a multi-dimensional array of data used in data
from other observations in a dataset. It is an visualization method that displays text data. warehousing and OLAP (Online Analytical Processing). It allows data
observation that lies an abnormally far distance from The size of each word in the cloud is to be modeled and viewed from multiple perspectives or dimensions
the other values. proportional to its frequency or importance (e.g., viewing sales by product, region, and time period).
d) Define Interquartile range. 4 in the source text, allowing for quick d) Give the purpose of data preprocessing? 13
The Interquartile Range (IQR) is a measure of insights into key themes. The purpose of data preprocessing is to transform raw, messy data into
statistical dispersion. It is calculated as the j) What is visual encoding? 10 a clean, consistent, and usable format. This step is crucial because
difference between the third quartile (Q3, the 75th Visual encoding is the process of mapping real-world data is often incomplete, inaccurate, or unstructured. Good
percentile) and the first quartile (Q1, the 25th data values (both quantitative and preprocessing improves the accuracy and reliability of machine
percentile). qualitative) to visual properties in a chart or learning models.
e) What do you mean by missing values? 5 graph. These properties include position e) What is the purpose of data visualization? 14
Missing values refer to the absence of data for a (X/Y coordinates), size, shape, color, and The purpose of data visualization is to communicate complex data and
specific variable or observation. In a dataset, they texture. insights clearly and effectively through graphical representations (like
are often represented as NULL, NA, or a blank charts and graphs). It helps humans easily identify patterns, trends,
space and must be handled during data cleaning. a) List the tools for data scientist. 10 and outliers that would be difficult to spot in raw text or tables
f) What are uses of zip files? 6 Key tools for a data scientist include:
Zip files are archives used for data compression and • Programming Languages: Python
a) Differentiate structured and Unstructured Data. 11
for bundling multiple files and folders into a single and R
| Feature | Structured Data | Unstructured Data |
file. This reduces the overall file size, making it • Data Libraries: Pandas, NumPy,
| :--- | :--- | :--- |
easier to store and faster to transmit. Scikit-learn, TensorFlow
| Organization | Highly organized, follows a predefined schema. | No
g) What do you mean by XML Files data format? • Databases: SQL
predefined format or organization. |
7 • Big Data Tech: Apache Spark, Hadoop
| Storage | Typically stored in relational databases (SQL) in tables. | Stored
XML (eXtensible Markup Language) is a semi- • Visualization Tools: Tableau, Power
in data lakes, NoSQL databases, or file systems. |
structured data format that uses tags to define BI, Matplotlib
| Examples | Excel spreadsheets, customer records in a SQL database. |
elements and their attributes. It is both human- Text in emails, images, audio files, social media posts. |
readable and machine-readable and is used for b) Define statistical data analysis? 11
| Analysis | Easy to analyze using standard query tools (SQL). | Difficult to
encoding and transporting data. Statistical data analysis is the process of
analyze; requires advanced techniques (NLP, CV). |
h) Define data discretization. 8 collecting, cleaning, analyzing,
Data discretization is the process of converting interpreting, and presenting data to
continuous (numerical) data into a finite number discover underlying patterns, trends, and
of discrete intervals or "bins." This transforms a insights. It uses statistical methods to
numerical variable into a categorical one (e.g., quantify relationships and test
converting 'Age' into 'Child', 'Adult', 'Senior'). hypotheses
b) What is inferential statistics? 12 a) Explain different applications of c) What do you mean by Noisy e) Explain 3V's of data science. 15
Inferential statistics is a branch of Data Science. 11 data? Explain any two causes of The 3V's are the core characteristics used to define Big
statistics that uses data from a small • Recommendation Engines: Used by noisy data. 13 Data:
sample to make inferences, services like Netflix and Amazon to Noisy data refers to 1. Volume: The massive amount and scale of data being
conclusions, or predictions about a suggest products or movies based on user meaningless, incorrect, or generated.
larger population. It involves methods behavior. random error data that is 2. Velocity: The high speed at which new data is generated
like hypothesis testing and confidence • Fraud Detection: Used in banking and included in a dataset. It obscures and must be processed (e.m., streaming data).
intervals. finance to identify suspicious transactions the true, underlying patterns. 3. Variety: The different types of data, including structured
c) What do you mean by data in real-time. Two common causes are: (tables), semi-structured (JSON, XML), and unstructured
preprocessing? 13 • Healthcare: Used for medical image 1. Human Data Entry Errors: (text, images, video).
Data preprocessing is the crucial set of analysis (e.g., detecting tumors) and Typos or mistakes made when
steps taken to clean and transform raw predicting disease outbreaks. data is manually entered. a) What are the measures of central tendency? Explain any
data into a high-quality, usable, and • Targeted Advertising: Used to analyze 2. Faulty Collection two of them in brief. 15
efficient format. It includes tasks like user data and display relevant Instruments: A sensor or Measures of central tendency are single values that attempt
handling missing values, data advertisements. device (like a faulty to describe the "center" or "typical" value of a dataset. The
cleaning, and data transformation, thermometer) that is not three main measures are Mean, Median, and Mode.
making the data suitable for analysis b) Explain null and alternate hypothesis. calibrated correctly and 1. Mean: This is the "average" of the data. It's calculated by
and machine learning. 12 provides inaccurate readings. summing all the values in the dataset and dividing by the
d) Define data discretization. 14 In statistical testing: total number of values. While very common, the mean is
Data discretization is the process of • Null Hypothesis ($H_0$): This is the d) What do you mean by data sensitive to outliers (extremely high or low values), which
converting continuous (numerical) default assumption that there is no effect visualization? Give example of can skew the result.
data into a finite number of discrete or no relationship between the variables any two data visualization 2. Median: This is the middle value in a dataset that has
intervals or "bins." This transforms a being tested. It's the statement that the libraries. 14 been sorted in ascending order. If the dataset has an even
numerical variable into a categorical researcher tries to disprove. Data visualization is the practice number of values, the median is the average of the two
one (e.g., converting 'Age' into 'Child', • Alternate Hypothesis ($H_1$ or of representing data and middle values. The median is "robust" to outliers, making it
'Adult', 'Senior'). $H_a$): This is the claim that the information graphically. It uses a better measure of center for skewed data.
e) What is visual encoding? 15 researcher is trying to prove. It charts, graphs, and maps to help
Visual encoding is the process of contradicts the null hypothesis and states users easily identify patterns,
mapping data values (both quantitative that there is a significant effect or trends, and outliers.
and qualitative) to visual properties in relationship. Two examples of data
a chart or graph. These properties visualization libraries (for
include position (X/Y coordinates), Python) are:
size, shape, color, and texture. 1. Matplotlib
2. Seaborn
c) What is venn diagram? How to b) What are the various types of data available? b) Write different data c) What is data cleaning? Explain any two data cleaning
create it? Explain with example 17 Give example of each? 16 visualization libraries in methods. 18
A Venn diagram is a visual Data can be broadly classified by its structure: python. 17 Data cleaning is a core part of data preprocessing. It is
illustration that uses overlapping 1. Structured Data: Highly organized data that • Matplotlib: This is the the process of identifying, correcting, or removing
circles to show all possible logical fits neatly into a predefined format, like rows and foundational library for errors, inconsistencies, and inaccuracies from a dataset
relationships between two or more columns in a table. o Example: A SQL database plotting in Python. It's to improve its quality.
sets of items. The overlapping areas of customer records (e.g., Name, Age, Email). highly customizable and Two common data cleaning methods are:
(intersections) show the elements 2. Unstructured Data: Data that has no provides a wide range of 1. Handling Missing Values: This addresses data
that the sets have in common. predefined format or organization, making it static, animated, and points that are not recorded. o Deletion: Rows (or
How to create it: more difficult to analyze. o Example: Text in interactive plots. columns) with a high percentage of missing values can
1. Identify the Sets: Determine the emails or social media posts, images, audio files, • Seaborn: Built on top of be removed entirely.
groups of data you want to compare. and videos. Matplotlib, Seaborn provides o Imputation: Missing values can be "filled in" using
2. Draw Circles: Draw one circle 3. Semi-Structured Data: Data that doesn't fit a a high-level interface for a statistical value, such as the mean or median (for
for each set. rigid table structure but has some organizational drawing attractive and numerical data) or the mode (for categorical data).
3. Overlap Circles: Overlap the properties (like tags or markers) that make it informative statistical 2. Handling Duplicates: This involves finding and
circles to represent the intersections easier to analyze than unstructured data. o graphics. It's particularly removing duplicate records. o Identification: Records
between the sets. Example: A JSON file from a web API or an strong at visualizing that are exact copies of one another are identified.
4. Populate the Diagram: Place XML document. complex relationships and o Removal: These duplicate rows are deleted from the
each data element into the a) Explain outlier detection methods in brief. 16 distributions. dataset because they can artificially skew statistical
appropriate section. Elements 1. Visual Method (Box Plots): A box plot (or box- • Plotly: A modern, analysis and model training.
unique to one set go in the non- and-whisker plot) is a powerful visualization for interactive visualization a) Explain data cube aggregation method in context of
overlapping parts, and elements detecting outliers. It displays the data's distribution library. It can create web- data reduction. 16
shared by multiple sets go in the based on five-number summary (min, Q1, median, based, interactive graphs that A data cube is a multi-dimensional data structure used in
overlapping sections. Q3, max). Data points that fall outside the allow for user engagement, data warehousing (e.g., sales data viewed by Product,
Example: "whiskers" (typically 1.5 times the interquartile such as hovering, zooming, Location, and Time). Aggregation is a data reduction
• Set A (Likes Coffee): {Ana, Ben, range above Q3 or below Q1) are plotted as and filtering. technique that involves pre-calculating and storing data
Chloe} individual points, making them easy to identify as • Bokeh: Similar to Plotly, summaries at higher, more abstract levels.
• Set B (Likes Tea): {Ben, Dave} outliers. Bokeh is designed for For example, instead of storing individual daily sales
You would draw two overlapping 2. Statistical Method (Z-Score): The Z-score creating interactive records (fine-grained data), the cube can be aggregated to
circles. Ana and Chloe go in the measures how many standard deviations a data point visualizations for modern store total monthly sales, quarterly sales, or yearly sales.
"Coffee only" part. Dave goes in the is from the mean. A common rule is to consider any web browsers. It's powerful This "rolls up" the data. This reduces the total volume of
"Tea only" part. Ben goes in the data point with an absolute Z-score greater than 3 for building interactive data that needs to be queried, resulting in much faster
overlapping section because he likes (i.e., Z > 3 or Z < -3) as an outlier, as it's dashboards and data analysis and reporting, even though it loses some of the
both highly improbable in a normal distribution. applications fine-grained detai
b) What is mean, mode, median and range for the a) Explain different data formats in brief. 18
1. CSV (Comma-Separated Values): A simple, plain-text c) Write details notes on basic data visualization tools?
following list of values: 24, 29, 24, 25, 24, 27,
format that stores tabular data. Each line is a data record (row), 20
25, 32, 24 17
and each value within the record is separated by a comma. It's (This question likely refers to basic chart types, not
First, sort the data (n=9):
widely compatible but lacks data type information. software.)
24, 24, 24, 24, 25, 25, 27, 29, 32
2. JSON (JavaScript Object Notation): A human-readable, Basic data visualization charts are graphical tools used to
1. Mean (Average):
semi-structured format that uses key-value pairs. It's the standard represent data:
(24 + 29 + 24 + 25 + 24 + 27 + 25 + 32 + 24) / 9
for web APIs and is flexible, supporting nested objects and lists. 1. Bar Chart: Uses rectangular bars (vertical or
= 234 / 9 = 26
3. XML (eXtensible Markup Language): A markup language horizontal) to compare values across different discrete
2. Mode (Most frequent value):
that defines rules for encoding documents. It uses tags to define categories. It's excellent for showing "how many" or
24 (it appears 4 times)
elements and their attributes. It's more verbose than JSON but is "how much" for distinct groups.
3. Median (Middle value):
very structured and often used in enterprise systems. 2. Line Chart: Connects a series of data points with a
The middle value is the (9+1)/2 = 5th value in
4. Excel (XLS/XLSX): Microsoft's proprietary spreadsheet line. This is the best tool for visualizing trends over a
the sorted list, which is 25.
format. It stores data in cells within worksheets and can include continuous interval, most commonly time.
4. Range (Max - Min):
formulas, formatting, and multiple sheets, but it's less ideal for 3. Pie Chart: A circular chart divided into slices, where
32 - 24 = 8
c) Explain any four data visualization tools. 18 programmatic access. each slice represents a proportion or percentage of the
(This question likely refers to basic chart types, not b) What is data quality? Which factors are affected data whole. It's used to show the parts-of-a-whole
software.) qualities? 19 composition of a single dataset.
1. Bar Chart: Uses vertical or horizontal bars to Data quality is a measure of the condition of data based on its 4. Scatter Plot: Uses dots on a 2D plane to represent the
compare discrete categories of data. It is excellent fitness to serve a specific purpose. High-quality data is accurate, values of two different numerical variables. It's essential
for showing "how many" or "how much" for complete, and consistent, making it reliable for analysis and for observing the relationship and correlation (or lack
different groups. decision-making. thereof) between two variables.
2. Line Chart: Connects a series of data points Key factors that affect data quality: 5. Histogram: Resembles a bar chart, but it's used for
with a line. This is the best tool for visualizing 1. Accuracy: Is the data correct and true to the source? (e.g., Is a numerical data. It groups data into continuous ranges
trends over a continuous interval, most commonly customer's age listed as "150"?) (called "bins") and shows the frequency (count) of data
time. 2. Completeness: Is any data missing? (e.g., Are there blank points falling into each bin, revealing the data's
3. Pie Chart: A circular chart divided into slices, entries for customer phone numbers?) underlying distribution.
where each slice represents a proportion or 3. Consistency: Is the data uniform across all systems? (e.g., Is
percentage of the whole. It is used to show the part- "California" also stored as "CA" in different tables?)
to-whole composition of a dataset. 4. Timeliness: Is the data up-to-date and relevant? (e.g., Using a
4. Scatter Plot: Uses dots on a 2D plane to 10-year-old customer address).
represent the values of two different numerical 5. Validity: Does the data conform to the defined rules and
variables. It is essential for observing the constraints? (e.g., A date field containing "abc").
relationship and correlation between two variables. 6. Uniqueness: Are there duplicate records for the same entity?
b) Explain data cube aggregation method in c) Explain any two data b) What do you mean by Data c) How do you visualize geospatial data?
detail. 19 transformation technique in detail. attributes? Explain types of attributes Explain in detail. 21
A data cube is a multi-dimensional data 20 with example. 20 Geospatial data is data that has a geographic
structure used in Online Analytical Processing 1. Normalization (Min-Max A data attribute (also known as a component (like latitude, longitude, or a region
(OLAP) to analyze data from multiple Scaling): feature, variable, or field) is a property like a state or country). It is visualized using
perspectives. o Purpose: This technique rescales or characteristic of a data object. For maps.
Aggregation is the core process that makes all numerical features in the dataset to example, if the data object is a Key visualization methods include:
data cubes fast and efficient. It involves pre- a fixed range, typically [0, 1]. "Student," its attributes might be 1. Choropleth Map: This is the most common
computing and storing summarized data at o Formula: X_normalized = "Name," "Age," and "GPA." method. It uses different shades of color for
various levels of granularity. (X - X_min) / (X_max - The main types of attributes are: predefined geographic regions (like states,
• Concept: Imagine a "sales" data cube with X_min) 1. Nominal: Categorical values with countries, or counties) to represent the value of
o Use Case: It's very useful for no intrinsic order or ranking. o a variable. For example, shading states darker
three dimensions: Time (Day > Month >
Year), Product (SKU > Brand > Category), algorithms that are sensitive to the Example: Colors (Red, Blue, Green), blue based on higher population density.
and Location (Store > City > Country). magnitude or scale of features and do Zip Codes. 2. Bubble Map: This map places a circle (a
• Pre-computation: Instead of calculating not assume a specific data distribution. "bubble") over a specific geographic point. The
total sales for "Q1 in the USA" from millions This includes distance-based 2. Ordinal: Categorical values that size of the bubble is used to represent a
of individual transaction records every time, algorithms like K-Nearest Neighbors have a meaningful order or rank, but numerical value, making it easy to compare
the cube aggregates this data beforehand. It (K-NN) and neural networks. the distance between them is not magnitudes at different locations (e.g., showing
defined. o Example: Size (Small, sales volume by city).
pre-calculates the sum (or average, count, etc.) 2. Standardization (Z-score
for all possible combinations of these Scaling): o Purpose: This technique Medium, Large), Education Level 3. Heatmap (Density Map): This visualization
dimensions (e.g., total sales per brand per transforms the data to have a mean of (Bachelor's, Master's, PhD). uses color intensity to show the concentration or
month, total sales per city per year). 0 and a standard deviation of 1. density of data points in an area. It is useful
• Operations: This aggregation allows for o Formula: X_standardized = 3. Interval: Numerical values where when there are too many individual points to
rapid analysis through operations like: o (X - μ) / σ (where μ is the the difference between them is plot, as it shows "hotspots" of activity
Roll-up: Navigating from a fine-grained level mean and σ is the standard meaningful, but there is no "true zero."
to a coarser one (e.g., from City to o Example: Temperature in Celsius
deviation).
Country or Day to Month). o Use Case: This is the most (0°C doesn't mean "no heat"),
o Drill-down: Navigating from a coarse level common scaling technique. It's Calendar Dates.
to a finer one (e.g., from Year to Quarter). beneficial for algorithms that assume 4. Ratio: Numerical values where the
In summary, data cube aggregation is a method the data follows a Gaussian (normal) difference is meaningful and there is a
"true zero," allowing for ratios. o
of pre-calculating and storing summarized data distribution, such as Logistic
to provide near-instantaneous answers to Regression, Linear Discriminant Example: Price ($0 means "no cost"),
complex analytical queries Analysis (LDA), and Principal Age, Weight.
Component Analysis (PCA).
a) Write a short note on b) Explain Exploratory Data Analysis (EDA) in detail. 22
a) What is outlier? State b) State and explain any three data
feature extraction. 21 Exploratory Data Analysis (EDA) is a critical initial step in the data
visual types of outliers. transformation techniques. 23
Feature extraction is a science process. It involves investigating and analyzing a dataset to
21(The question text "Star a Data transformation is the process of
dimensionality reduction understand its main characteristics, before applying formal modeling or
visual types of outliers" 22 converting data from one format or
technique used to create a hypothesis testing. The primary goal is to gain insights, discover
likely means "State ways to structure into another to make it
new, smaller set of features patterns, spot anomalies, and check assumptions.
visualize outliers".) suitable for modeling.
from the original set. Unlike Key activities in EDA include:
An outlier is a data point that 1. Normalization (Min-Max Scaling):
feature selection (which just 1. Understanding Variables: Identifying the data types of each feature
differs significantly from This technique scales numerical data to
picks a subset of existing (e.g., numerical, categorical, text).
other observations in a a fixed range, typically 0 to 1. It is
calculated as: (value - min) / features), feature extraction 2. Univariate Analysis: Analyzing one variable at a time. o
dataset. It is an observation
(max - min). This is useful for transforms the data, creating Numerical: Using summary statistics (mean, median, standard
that lies an abnormally far
new features that are deviation) and visualizations like histograms or box plots to understand
distance from the other algorithms (like K-Nearest Neighbors)
combinations of the original the variable's distribution and identify outliers.
values, either much larger or that are sensitive to the magnitude of
ones. o Categorical: Using frequency tables and bar charts to understand the
much smaller. features but don't assume a specific data
The goal is to capture the counts and proportions of each category.
Outliers can be visually distribution.
most relevant information and 3. Bivariate Analysis: Analyzing the relationship between two variables.
identified using: 2. Standardization (Z-score Scaling):
variance from the original o Num-Num: Using scatter plots to check for correlation.
1. Box Plots (Box-and- This technique transforms data to have
data in fewer dimensions. o Num-Cat: Using bar charts (of means) or box plots to compare the
Whisker Plots): This is the a mean of 0 and a standard deviation
of 1. It's calculated as: (value - This helps to: numerical variable across different categories.
most common and direct
mean) / standard_deviation. • Reduce the complexity of o Cat-Cat: Using stacked bar charts or contingency tables.
method. A box plot displays
models. 4. Multivariate Analysis: Analyzing the relationships between three or
the data's quartiles (the main This is very common and beneficial for
• Prevent overfitting (the more variables, often using tools like correlation heatmaps or pair plots.
"box") and its range (the algorithms (like PCA and Logistic
"curse of dimensionality"). 5. Handling Missing Data: Identifying which columns have missing
"whiskers"). Any data points Regression) that assume the data
• Improve model training values and quantifying the amount, which informs later preprocessing
that fall outside the whiskers follows a Gaussian (normal)
speed and performance. steps.
are explicitly plotted as distribution.
individual dots, clearly 3. Log Transformation: This involves EDA is an iterative, "detective-work" phase that helps a data scientist
taking the logarithm (e.g., log(x)) of A classic example is Principal build intuition about the data and formulate hypotheses for further
identifying them as outliers.
Component Analysis (PCA), testing.
2. Scatter Plots: In a scatter each data point. It is extremely useful
which creates new
plot, outliers are visible as for handling data that is highly skewed
uncorrelated variables
points that are far removed (has a long tail). This transformation
(principal components) that
from the main cluster of data. compresses the range of large numbers
are linear combinations of the
They "stand alone" from the and can help make the data's
original variables, ordered by
general pattern or trend distribution more symmetrical and
the amount of variance they
shown by the other points normal
a) What do you mean by Data transformation? b) What are the different methods for
Explain strategies of data transformation. 22 measuring the data dispersion? 23
Data transformation is a data preprocessing Methods for measuring data
technique used to convert data from its original dispersion (or variability) quantify
format into a format that is more suitable for how "spread out" the values in a
analysis and modeling. dataset are:
Common strategies for data transformation 1. Range: The simplest measure,
include: calculated as the difference between
1. Normalization (Min-Max Scaling): This the maximum and minimum values in
scales numerical data to a fixed range, typically the dataset.
0 to 1. It is useful for algorithms (like K-Nearest 2. Interquartile Range (IQR): The
Neighbors) that are sensitive to the magnitude of range of the middle 50% of the data,
features. calculated as Q3 - Q1. It is robust
2. Standardization (Z-score Scaling): This to outliers.
transforms data to have a mean of 0 and a 3. Variance: The average of the
standard deviation of 1. It is very common and squared differences from the Mean. It
required for algorithms that assume a normal measures the overall spread but is in
distribution (like PCA). squared units.
3. Log Transformation: Applying a logarithm 4. Standard Deviation: The square
(e.g., log(x)) to data. This is extremely useful root of the variance. It is the most
for handling data that is highly skewed (has a common measure and represents the
long tail), as it compresses the range of large average distance of any data point
numbers. from the mean.