0% found this document useful (0 votes)
3 views8 pages

Dads302 - Exploratory Data Analysis

The document discusses the role of Data Science across various domains, emphasizing its importance in healthcare, finance, retail, and marketing. It also covers measures of dispersion in statistics, data visualization techniques, feature selection methods, and the concept of factor analysis. Each section provides insights into how data science and statistical methods can enhance decision-making and improve outcomes in different fields.

Uploaded by

dixitshetty49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

Dads302 - Exploratory Data Analysis

The document discusses the role of Data Science across various domains, emphasizing its importance in healthcare, finance, retail, and marketing. It also covers measures of dispersion in statistics, data visualization techniques, feature selection methods, and the concept of factor analysis. Each section provides insights into how data science and statistical methods can enhance decision-making and improve outcomes in different fields.

Uploaded by

dixitshetty49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NAME: TANISHQ SONKER

ROLL NO.: 2414503852


PROGRAM: MASTER OF BUSINESS ADMINISTRATION (MBA)
SEMESTER: III
COURSE CODE & NAME: DADS302- EXPLORATORY DATA ANALYSIS

Assignment Set – 1
Questions

1. What is Data Science? Discuss the role of Data Science in various Domains.

Data Science is an interdisciplinary field that combines statistical methods, mathematical algorithms, and
computational technology to extract meaningful insights and knowledge from structured and unstructured data.
At its core, it is the practice of using data to answer questions, predict future trends, and drive strategic decision-
making.

The field is often visualized as the intersection of three primary pillars: Mathematics/Statistics, Computer
Science/Programming, and Domain Expertise. Without domain expertise, data science is merely a technical
exercise; with it, data science becomes a powerful engine for innovation.

The Core Components of Data Science: -


a. Data Collection & Cleaning: Gathering raw data from various sources (databases, APIs, sensors) and
refining it by removing noise and inconsistencies.
b. Exploratory Data Analysis (EDA): Using visual and statistical tools to understand patterns, detect outliers,
and test hypotheses.
c. Modelling & Machine Learning: Applying algorithms to the data to create predictive models or identify
complex relationships.
d. Communication: Translating technical findings into actionable insights for non-technical stakeholders
through data visualization and storytelling.

The Role of Data Science Across Various Domains: -


Data science is not confined to a single industry; its applications are universal. By leveraging big data,
organizations can shift from reactive decision-making to proactive strategy.

i. Healthcare:
Data science has revolutionized patient care and medical research.
• Predictive Diagnostics: Algorithms analyze medical imaging (X-rays, MRIs) to detect anomalies like
tumors with higher accuracy than human practitioners.
• Drug Discovery: Instead of years of trial and error, data science simulates chemical interactions to speed
up the development of new medications.
• Personalized Medicine: Genomic data analysis allows doctors to tailor treatments to a patient's specific
genetic profile.

ii. Finance and Banking:


In the financial sector, where risk and speed are paramount, data science is essential.
• Fraud Detection: Real-time monitoring of transactions uses machine learning to identify suspicious
patterns and prevent unauthorized activity instantly.
• Algorithmic Trading: High-frequency trading models analyze market data in milliseconds to execute
trades at the most profitable moments.
• Credit Scoring: Beyond traditional history, banks now use alternative data (utility payments, social media
behavior) to assess the creditworthiness of individuals.

iii. E-commerce and Retail:


Retailers use data to optimize the entire customer journey.
• Recommendation Engines: Platforms like Amazon or Netflix use collaborative filtering to suggest
products based on a user's browsing history and the behaviour of similar customers.
• Inventory Management: Predictive analytics forecast demand, ensuring that popular products are stocked
while reducing the waste of overstocking.
• Dynamic Pricing: Algorithms adjust prices in real-time based on competitor pricing, demand surges, and
inventory levels.

iv. Marketing and Social Media:


Data science allows for hyper-targeted communication.
• Sentiment Analysis: Brands analyse social media comments and reviews to gauge public opinion about
their products and adjust their messaging accordingly.
• Churn Prediction: Companies identify customers who are likely to stop using a service (like a
subscription) and offer targeted incentives to retain them.

Summary of Impact: -
Domain Primary Application Key Benefit
Healthcare Disease Prediction Improved Patient Outcomes
Finance Fraud Prevention Risk Mitigation
Retail Recommender Systems Increased Sales/Engagement
Manufacturing Predictive Maintenance Reduced Downtime
Logistics Route Optimization Cost & Time Efficiency

In conclusion, Data Science acts as a bridge between raw information and intelligent action. By transforming vast
quantities of data into strategic assets, it enables organizations across all domains to solve complex problems and
stay competitive in an increasingly data-driven world.

2. Explain various measures of dispersion in detail using specific examples.


In descriptive statistics, **measures of dispersion** (or variability) describe how spread out or scattered the data
points are within a dataset. While measures of central tendency like the mean tell us where the "center" of the
data is, dispersion tells us how much the individual values deviate from that center.

To illustrate these concepts, imagine two branches of **DOCA Wellness** in Bangalore. Both have an average
daily sales figure of ₹5,000, but their consistency differs:

Branch A: Sales are always between ₹4,800 and ₹5,200.


Branch B: Sales fluctuate between ₹1,000 and ₹9,000.

Dispersion helps us quantify this difference in risk and reliability.

i. Range: -
The simplest measure of dispersion, the Range, is the difference between the maximum and minimum values in
a dataset.

Formula: Range = Max - Min


Example: In Branch B, if the highest sales day is ₹9,000 and the lowest is ₹1,000, the range is ₹8,000.
Limitation: It is highly sensitive to outliers. A single exceptionally good or bad day can drastically change the
range without reflecting the typical spread.

ii. Interquartile Range (IQR):


The IQR focuses on the middle 50% of the data, making it far more robust against outliers than the range. It
measures the distance between the 25th percentile (Q1 ) and the 75th percentile (Q3 ).

Formula: IQR = Q3 – Q1
Example: If you rank the sales of Branch B for a month and find that is ₹3,000 and is ₹7,000, the IQR is ₹4,000.
This tells you where the "bulk" of your business happens.
iii. Variance:
Variance measures the average squared deviation of each data point from the mean. Squaring the differences
ensures that positive and negative deviations don't cancel each other out.
2
∑(xi−μ )
Formula:σ2 = N
Example: If Branch A has sales very close to the mean, (xi−μ ) will be small, resulting in a low variance. Branch
B will have a high variance because many days are far from the average.
Note: Because the units are squared (e.g., "rupees squared"), variance is difficult to interpret intuitively in
business terms.

iv. Standard Deviation:

The Standard Deviation is the square root of the variance. It is the most widely used measure of dispersion because
it is expressed in the same units as the original data.

Formula:σ = √variance
Example: If the standard deviation for Branch A is ₹150, you can say that most days, sales will fall within ₹150
of the ₹5,000 average. If Branch B’s standard deviation is ₹2,500, the business is much more volatile.

v. Coefficient of Variation (CV):

The CV is a relative measure of dispersion, expressed as a percentage. It is used to compare the spread of two
datasets that have different units or widely different means.

Standard Deviation
Formula:CV = ( ) × 100%
Mean
Example: If you want to compare the variability of sales in Bangalore (INR) versus a partner clinic in London
(GBP), the CV allows for a direct comparison of which operation is "more stable" relative to its size.

Summary Table of Measures:


Measure Best Used For... Sensitivity to Outliers
Range Quick, rough estimate of spread. Very High
IQR Data with extreme values/skewed data. Low
Variance Mathematical modelling and volatility. High
Standard Deviation Standard reporting and risk assessment. High
CV Comparing variability across different scales. High

3. Discuss various techniques used for Data Visualization.

Data visualization is the graphical representation of information and data. By using visual elements like charts,
graphs, and maps, data visualization techniques provide an accessible way to see and understand trends, outliers,
and patterns in data. For an MBA student, mastering these techniques is essential for translating complex analytics
into persuasive business narratives.

i. Comparison Techniques:
Comparison charts are used to compare the magnitude of values across different categories or over a period of
time.
• Bar Charts: The most common tool for comparing discrete categories. Vertical bars are typically used for
time series, while horizontal bars are better for long category labels.
• Grouped/Stacked Bar Charts: These allow for a secondary level of comparison. For example, comparing
sales across different branches of DOCA Wellness while simultaneously showing the breakdown of
services (Grooming vs. Daycare) within each branch.

ii. Composition Techniques:


These techniques show how individual parts make up a whole.
• Pie and Donut Charts: Used to show proportions among a small number of categories.
• Waffle Charts: As discussed previously, these use a 10x10 grid to represent percentages, offering better
precision than pie charts for comparing similar values.
• Treemaps: These use nested rectangles to show hierarchies and part-to-whole relationships. They are
excellent for visualizing complex budgets or large product portfolios where area represents value.

iii. Distribution Techniques:


Distribution visualizations help data scientists understand the spread, frequency, and skewness of a dataset.
• Histograms: Used to show the frequency distribution of a continuous variable by grouping data into
"bins." This is vital for understanding things like customer age demographics.
• Box Plots (Whisker Plots): These are essential for identifying outliers and understanding the quartiles and
median of a dataset. In your St. Joseph’s University assignments, a box plot could effectively show the
variation in exam scores across different sections.

iv. Relationship Techniques:


These techniques explore correlations and dependencies between two or more variables.
• Scatter Plots: The primary tool for identifying relationships. For instance, plotting "Marketing Spend"
against "Revenue" to see if there is a positive correlation.
• Bubble Charts: An extension of the scatter plot where a third dimension is added through the size of the
bubbles.
• Heatmaps: These use colour intensity to represent the magnitude of a variable across two dimensions.
They are frequently used in business to show peak store hours or website "click" activity.

v. Trend and Connection Techniques:


These are used to visualize data that changes over time or across locations.
• Line Charts: The gold standard for time-series analysis. They help in spotting seasonal trends or long-
term growth in business KPIs.
• Choropleth Maps: Thematic maps where areas are shaded in proportion to a statistical variable (e.g., a
map of Bangalore shaded by the density of pet owners for your business plan).

Technique Selection Guide: -


Goal Recommended Technique
Comparing Categories Bar Chart, Clustered Bar
Showing Trends over Time Line Chart, Area Chart
Analysing Distributions Histogram, Box Plot
Finding Correlations Scatter Plot, Heatmap
Part-to-Whole Waffle Chart, Treemap

In Exploratory Data Analysis (EDA), visualization is often the first step to forming a hypothesis. By choosing
the right technique, you ensure that the insights you discover are not only accurate but also easily communicable
to stakeholders.

Assignment Set – 2
Questions

4. What is feature selection? Discuss any two feature selection techniques used to get optimal feature
combinations.

Feature Selection is the process of selecting a subset of the most relevant features (variables, columns, or
predictors) from the original dataset to be used in building a predictive model.

In a real-world dataset, you might have hundreds of columns (e.g., a customer database might have age, income,
shoe size, favourite colour, and zodiac sign). However, not all these features contribute to predicting the target
variable (e.g., "Will they buy a car?"). Some features are irrelevant (zodiac sign), and some are redundant (date
of birth and age provide the same information).
• Improves Accuracy: Removing misleading data improves the model's performance.
• Reduces Overfitting: Fewer redundant data points mean less opportunity for the model to make decisions
based on noise.
• Reduces Training Time: Fewer columns mean faster computation.

Techniques for Feature Selection: -


There are three main categories of feature selection: Filter, Wrapper, and Embedded methods.2 Below are detailed
discussions on two of the most widely used techniques:

Technique A: Filter Methods (specifically "Correlation Matrix")


Concept: Filter methods select features based on their statistical scores, independent of any machine learning
algorithm. They act as a "screening" step before the model is even built. The most common filter technique is
analysing the Correlation Matrix.

How it Works:
a. Correlation Check: The algorithm calculates the correlation coefficient (usually Pearson’s r) between
every pair of input variables.
b. Multicollinearity Detection: It looks for features that are highly correlated with each other (e.g., a
correlation > 0.85).
c. Elimination: If two features are highly correlated (e.g., "Monthly Salary" and "Annual Salary"), they
essentially provide the same information to the model. The technique keeps one and drops the other to
remove redundancy.

Pros & Cons:


• Pros: Very fast and computationally cheap. Good for high-dimensional datasets.
• Cons: It looks at features in isolation and ignores the interaction between features.

Technique B: Wrapper Methods (specifically "Recursive Feature Elimination - RFE")


Concept: Wrapper methods are more computationally intensive. They "wrap" a specific machine learning model
around the feature selection process. They treat feature selection as a search problem, evaluating different
combinations of features to see which specific combination produces the best model performance.
How RFE Works:
a. Build Initial Model: The algorithm builds a model (e.g., a Linear Regression or Random Forest) using all
available features.
b. Rank Features: Based on the model's output, it calculates an "importance score" for each feature.
c. Prune: It identifies the least important feature and removes it from the dataset.8
d. Repeat (Recurse): It rebuilds the model with the remaining features and repeats the process.
e. Final Selection: This cycle continues until the desired number of features is reached or the model
performance peaks.

Pros & Cons:


• Pros: Highly accurate because it considers how features interact with the specific model being used.
• Cons: Very slow and computationally expensive, as the model must be retrained dozens or hundreds of
times.

Summary Comparison: -
Feature Filter Method (Correlation) Wrapper Method (RFE)
Speed Fast Slow
Model Dependency Independent (General) Dependent on a specific model
Best For Removing irrelevant/redundant data Finding the absolute best combination

5. Discuss in detail the concept of Factor Analysis

The Concept of Factor Analysis:


Factor Analysis is a statistical method used to describe variability among observed, correlated variables in terms
of a potentially lower number of unobserved variables called factors.
In simpler terms, it is a data reduction technique. It takes a massive dataset with many variables and reduces them
into a smaller, manageable number of "underlying themes" or "dimensions" without losing much critical
information.

The Core Logic: Imagine you conduct a survey asking people 100 different questions about their shopping habits.
It is difficult to analyse 100 separate answers. However, you might notice that people who answer "Yes" to
Question 5 also answer "Yes" to Questions 10, 15, and 20. Factor analysis identifies these patterns and groups
these related variables into a single Factor (e.g., "Price Sensitivity").

Key Terminology in Factor Analysis:

To understand how Factor Analysis works, one must understand its specific vocabulary:

A. Latent Variables (Factors):


These are the underlying, unobservable constructs. We cannot measure them directly, but we measure them
through observable variables.
Example: "Intelligence" is a latent factor. We cannot measure it directly, so we measure "Math Score," "Verbal
Score," and "Logic Score" (Observed Variables) to infer it.

B. Factor Loading:
This is the correlation coefficient between an observed variable and the factor. It ranges from -1 to 1.
A high loading (e.g., 0.8) means the variable is strongly linked to that factor.
A low loading (e.g., 0.1) means the variable is irrelevant to that factor.

C. Eigenvalues:
An Eigenvalue represents the amount of variance in the total data that is explained by a single factor.
The Rule of Thumb: usually, only factors with an Eigenvalue greater than 1.0 are retained for analysis (Kaiser
Criterion).

D. Factor Rotation:
After the computer extracts factors, the raw data is often hard to interpret (variables might load on multiple
factors). "Rotation" (e.g., Varimax Rotation) is a mathematical step that clarifies the picture, maximizing the
distinction between factors so that each variable loads clearly on only one factor.

Types of Factor Analysis: -

There are two primary approaches:

i. Exploratory Factor Analysis (EFA):


Used when you do not have a pre-defined idea of the structure or how many dimensions are in a set of variables.
You are "exploring" the data to see what patterns emerge.
Use Case: A marketer launching a new product does a survey to find out what features customers value.

ii. Confirmatory Factor Analysis (CFA):


Used to test a specific hypothesis. You already have a theory (e.g., "I believe there are exactly 3 personality
traits"), and you use CFA to see if the data supports this theory.

Use Case: Validating a psychological test to ensure it accurately measures "Anxiety" and "Depression" as separate
constructs.

iv. Practical Example: Customer Satisfaction:


Imagine a restaurant wants to analyse customer satisfaction. They ask customers to rate 6 items on a scale of 1-
10:

• Taste of food
• Speed of service
• Cleanliness of plates
• Friendliness of staff
• Temperature of food
• Décor and ambiance
Without Factor Analysis:The manager has to analyse 6 different scores.
With Factor Analysis: The algorithm calculates the correlations and groups them:
Items 1, 3, and 5 are highly correlated Factor A
Items 2, 4, and 6 are highly correlated Factor B

Interpretation:
The manager names Factor A "Product Quality."
The manager names Factor B "Service Experience."

Now, instead of tracking 6 messy variables, the manager simply tracks 2 strategic factors.

v. Importance of Factor Analysis

• Simplification: It makes complex datasets understandable.


• Multicollinearity Removal: In regression analysis, highly correlated variables can break the model. Factor
analysis solves this by combining them into a single score.
• Survey Design: It helps researchers remove redundant questions from surveys (if two questions measure
the exact same factor, one can be deleted).

6. Differentiate between Principal Component Analysis and Linear Discriminant Analysis


Both PCA and LDA are linear transformation techniques used for dimensionality reduction (reducing the number
of variables in a dataset), but they operate with fundamentally different goals.
i. The Fundamental Goal (Unsupervised vs. Supervised)
• PCA (Principal Component Analysis):
o Type: It is an Unsupervised Learning algorithm.
o Goal: It ignores class labels (it doesn't care if a data point belongs to "Group A" or "Group B"). Its
only goal is to find the "directions" (Principal Components) that capture the maximum variance
(spread) in the data. It tries to preserve the overall shape of the dataset while compressing it.
• LDA (Linear Discriminant Analysis):
o Type: It is a Supervised Learning algorithm.
o Goal: It explicitly uses class labels. Its goal is to find the axes that maximize the separation
between different classes while minimizing the spread within each class. It tries to make the classes
as distinct as possible.
ii. The Mechanics (Variance vs. Separation):
• PCA: Focuses on the dataset's internal structure. It projects data onto new axes where the global data
variance is highest.
o Analogy: Imagine taking a photo of a flying saucer. PCA tries to find the angle that shows the
saucer's widest shape (biggest size).
• LDA: Focuses on the boundaries between classes. It computes the "Between-Class Variance" and "Within-
Class Variance" and tries to maximize the ratio of the two.
o Analogy: Imagine taking a photo of a flying saucer and an airplane. LDA tries to find the angle
where the two objects look most distinct from each other, even if that angle makes them look
small.
iii. Component Construction:
• PCA: Computes Principal Components (PC1, PC2...) which are orthogonal (perpendicular) to each other.
The number of components you can find is equal to the number of original features.
• LDA: Computes Linear Discriminants (LD1, LD2...). The number of discriminants is limited to $C - 1$,
where $C$ is the number of classes. For example, if you are classifying "Cats vs. Dogs" (2 classes), LDA
can produce only 1 axis of separation, regardless of how many features (weight, height, colour) you have.

Feature Principal Component Analysis (PCA) Linear Discriminant Analysis (LDA)


Learning Unsupervised (Ignores labels) Supervised (Uses labels)
Type
Primary Maximizing Variance Maximizing Class Separation
Focus
Use Case Data compression, Visualization, Noise Classification problems
reduction
Output Limit Equal to the number of features Equal to (Number of Classes - 1)
Best When You have high-dimensional data with no You have labeled data and need to build a
labels. classifier.

You might also like