0% found this document useful (0 votes)
11 views15 pages

Data Analyst Interview Questions Guide

This document provides a comprehensive guide for data analyst interviews, including over 50 questions categorized into SQL, Python/Pandas, Statistics, Business/Case Studies, and Behavioral questions. Each section contains common interview questions along with detailed answers and explanations to help candidates prepare effectively. The guide emphasizes the importance of understanding concepts and articulating thought processes during interviews.

Uploaded by

nehasonis2424
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views15 pages

Data Analyst Interview Questions Guide

This document provides a comprehensive guide for data analyst interviews, including over 50 questions categorized into SQL, Python/Pandas, Statistics, Business/Case Studies, and Behavioral questions. Each section contains common interview questions along with detailed answers and explanations to help candidates prepare effectively. The guide emphasizes the importance of understanding concepts and articulating thought processes during interviews.

Uploaded by

nehasonis2424
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Analyst Interview Questions

SQL + Python + Statistics + Business

50+ questions with answers to crack your next interview

What's Inside:
• SQL Questions (Most asked - 15 questions)

• Python/Pandas Questions (10 questions)

• Statistics Questions (10 questions)

• Business/Case Study Questions (10 questions)

• Behavioral Questions (5 questions)


PART 1: SQL Questions
SQL is asked in 95% of data analyst interviews. Master these.

Q1: What's the difference between WHERE and HAVING?


Answer:

• WHERE filters rows BEFORE grouping (works on raw data)

• HAVING filters AFTER grouping (works on aggregated results)

-- WHERE example: Filter before grouping


SELECT department, AVG(salary)
FROM employees
WHERE status = 'Active' -- filters individual rows first
GROUP BY department;

-- HAVING example: Filter after grouping


SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000; -- filters grouped results

Q2: Find the second highest salary from Employee table.


Answer: Multiple approaches:

-- Method 1: Subquery
SELECT MAX(salary) FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);

-- Method 2: LIMIT OFFSET (MySQL/PostgreSQL)


SELECT DISTINCT salary FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;

-- Method 3: DENSE_RANK (works for Nth highest)


SELECT salary FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;

Q3: Find duplicate records in a table.


SELECT email, COUNT(*) as count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;

-- To see all duplicate rows:


SELECT * FROM customers
WHERE email IN (
SELECT email FROM customers
GROUP BY email
HAVING COUNT(*) > 1
);

Q4: What are different types of JOINs?


Answer:

• INNER JOIN - Only matching rows from both tables

• LEFT JOIN - All rows from left + matching from right (NULL if no match)

• RIGHT JOIN - All rows from right + matching from left

• FULL OUTER JOIN - All rows from both (NULL where no match)

• CROSS JOIN - Every row from A with every row from B (Cartesian product)
Q5: Difference between RANK, DENSE_RANK, and ROW_NUMBER?
-- Data: Salaries = 100, 100, 90, 80

-- ROW_NUMBER: 1, 2, 3, 4 (unique numbers, no gaps)


-- RANK: 1, 1, 3, 4 (same value = same rank, gaps after ties)
-- DENSE_RANK: 1, 1, 2, 3 (same value = same rank, no gaps)

SELECT
name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) as row_num,
RANK() OVER (ORDER BY salary DESC) as rank,
DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank
FROM employees;

Q6: Write a query to find employees who earn more than their manager.
SELECT [Link] as employee, [Link] as emp_salary,
[Link] as manager, [Link] as mgr_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id
WHERE [Link] > [Link];

Q7: What's the difference between DELETE, TRUNCATE, and DROP?


• DELETE - Removes specific rows, can use WHERE, can rollback, slower

• TRUNCATE - Removes ALL rows, cannot use WHERE, faster, resets auto-increment

• DROP - Removes entire table structure + data, cannot recover

Q8: Find customers who placed orders in both 2023 AND 2024.
-- Using INTERSECT
SELECT customer_id FROM orders WHERE YEAR(order_date) = 2023
INTERSECT
SELECT customer_id FROM orders WHERE YEAR(order_date) = 2024;

-- Using JOIN
SELECT DISTINCT o1.customer_id
FROM orders o1
JOIN orders o2 ON o1.customer_id = o2.customer_id
WHERE YEAR(o1.order_date) = 2023 AND YEAR(o2.order_date) = 2024;

Q9: Calculate running total of sales.


SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) as running_total
FROM sales;

-- Running total by customer


SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) as customer_running_total
FROM sales;
Q10: Find the top 3 products by sales in each category.
SELECT * FROM (
SELECT
category,
product_name,
total_sales,
DENSE_RANK() OVER (
PARTITION BY category
ORDER BY total_sales DESC
) as rank
FROM products
) ranked
WHERE rank <= 3;
Quick SQL Questions:

Q11: UNION vs UNION ALL?


UNION removes duplicates (slower). UNION ALL keeps all rows (faster).

Q12: What is a Primary Key vs Foreign Key?


Primary Key: Unique identifier for each row, cannot be NULL.

Foreign Key: References primary key of another table, creates relationship.

Q13: What is normalization?


Organizing data to reduce redundancy. 1NF (no repeating groups), 2NF (no partial dependencies), 3NF
(no transitive dependencies).

Q14: What is an index? When to use?


Index speeds up data retrieval (like a book index). Use on columns used in WHERE, JOIN, ORDER BY.
Don't over-index - slows down INSERT/UPDATE.

Q15: NULL vs empty string vs 0?


NULL = unknown/missing value. Empty string = blank text (exists but empty). 0 = actual numeric value.

■ SQL Interview Tip: Always talk through your approach before writing. 'First I'll filter X, then join with Y,
then aggregate by Z.' Shows you can think, not just memorize.
PART 2: Python/Pandas Questions
Q1: How do you handle missing values in pandas?
# Check missing values
[Link]().sum()

# Drop rows with any missing values


[Link]()

# Drop rows where specific columns are missing


[Link](subset=['email', 'phone'])

# Fill missing values


df['salary'].fillna(df['salary'].mean(), inplace=True)
df['category'].fillna('Unknown', inplace=True)

# Forward fill (use previous value)


df['value'].fillna(method='ffill')

Q2: Difference between loc and iloc?


• loc - Label-based selection (uses column names, index labels)

• iloc - Integer-based selection (uses position numbers)

# loc (by label)


[Link][0:5, ['name', 'salary']] # rows 0-5 (inclusive), specific columns

# iloc (by position)


[Link][0:5, 0:3] # rows 0-4 (exclusive end), columns 0-2

Q3: How to merge two DataFrames?


# Inner join (default)
merged = [Link](df1, df2, on='customer_id')

# Left join
merged = [Link](df1, df2, on='customer_id', how='left')

# Join on different column names


merged = [Link](df1, df2, left_on='cust_id', right_on='customer_id')

# Multiple join columns


merged = [Link](df1, df2, on=['customer_id', 'product_id'])

Q4: Explain groupby in pandas.


# Basic groupby
[Link]('category')['sales'].sum()

# Multiple aggregations
[Link]('category')['sales'].agg(['sum', 'mean', 'count'])

# Group by multiple columns


[Link](['region', 'category'])['sales'].sum()

# Different aggregations for different columns


[Link]('category').agg({
'sales': 'sum',
'quantity': 'mean',
'customer_id': 'nunique'
})

Q5: How to create a new column based on conditions?


# Simple condition
df['status'] = [Link](df['score'] >= 50, 'Pass', 'Fail')

# Multiple conditions
df['grade'] = [Link](
[df['score'] >= 90, df['score'] >= 80, df['score'] >= 70],
['A', 'B', 'C'],
default='F'
)

# Using apply with function


def categorize(row):
if row['sales'] > 10000:
return 'High'
elif row['sales'] > 5000:
return 'Medium'
return 'Low'

df['tier'] = [Link](categorize, axis=1)


More Python Questions:

Q6: List vs Tuple vs Set vs Dictionary?


List: Ordered, mutable, allows duplicates [1, 2, 3]

Tuple: Ordered, immutable, allows duplicates (1, 2, 3)

Set: Unordered, mutable, NO duplicates {1, 2, 3}

Dict: Key-value pairs, keys unique {'a': 1, 'b': 2}

Q7: What is a lambda function?


# Lambda = anonymous one-line function
square = lambda x: x**2

# Common use with pandas


df['column'].apply(lambda x: [Link]())
df.sort_values(key=lambda x: [Link]())

Q8: How to remove duplicates in pandas?


df.drop_duplicates() # All columns
df.drop_duplicates(subset=['email']) # Based on email
df.drop_duplicates(keep='last') # Keep last occurrence

Q9: Difference between apply, map, and applymap?


• apply: Works on rows or columns (Series or DataFrame)

• map: Works element-wise on a Series only

• applymap: Works element-wise on entire DataFrame

Q10: How to pivot a DataFrame?


# Pivot table
pivot = df.pivot_table(
values='sales',
index='region',
columns='category',
aggfunc='sum'
)

# Melt (unpivot)
melted = [Link](df, id_vars=['id'], value_vars=['Q1', 'Q2', 'Q3'])
PART 3: Statistics Questions
Q1: Mean vs Median vs Mode - when to use what?
Mean: Average of all values. Use for normally distributed data.

Median: Middle value. Use when data has outliers (salary, house prices).

Mode: Most frequent value. Use for categorical data.

Example: Salaries of 50K, 55K, 60K, 65K, 500K → Mean=146K (misleading), Median=60K (better)

Q2: What is standard deviation?


Measures how spread out data is from the mean. Low SD = data clustered around mean. High SD = data
spread out.

68% of data within 1 SD, 95% within 2 SD, 99.7% within 3 SD (for normal distribution).

Q3: What is correlation? What does correlation of 0.8 mean?


Correlation measures linear relationship between two variables (-1 to +1).

+1 = Perfect positive (both increase together)

-1 = Perfect negative (one increases, other decreases)

0 = No linear relationship

0.8 = Strong positive correlation - when one increases, the other likely increases too.

■■ Correlation ≠ Causation. Ice cream sales and drowning are correlated (both high in summer), but
one doesn't cause the other.

Q4: What is a p-value?


Probability of getting your result if the null hypothesis is true.

p < 0.05: Result is 'statistically significant' - unlikely due to random chance.

p > 0.05: Cannot reject null hypothesis - result might be due to chance.

Q5: Type I vs Type II error?


Type I (False Positive): Rejecting true null hypothesis. Example: Saying drug works when it doesn't.

Type II (False Negative): Failing to reject false null. Example: Saying drug doesn't work when it does.

Q6: What is the Central Limit Theorem?


When you take many samples from ANY distribution, the sample means will be normally distributed.

This is why normal distribution is so important - it shows up everywhere.

Q7: What is A/B testing?


Comparing two versions to see which performs better.

• Split users randomly into control (A) and treatment (B)

• Measure metric (conversion, clicks, revenue)

• Use statistical test to determine if difference is significant


Q8: Explain the difference between population and sample.
Population: Entire group you want to study (all customers).

Sample: Subset you actually measure (1000 surveyed customers).

We use samples because measuring entire population is often impossible.

Q9: What is a confidence interval?


Range where true value likely lies. '95% confidence interval of 45-55' means:

If we repeated this study 100 times, 95 of those intervals would contain the true value.

Q10: Normal distribution vs skewed distribution?


Normal: Symmetric bell curve, mean = median = mode.

Right-skewed: Long tail to right (income, house prices). Mean > Median.

Left-skewed: Long tail to left (age at retirement). Mean < Median.


PART 4: Business/Case Study Questions
Q1: How would you measure the success of a new product feature?
Framework:

1. Define the goal - What problem does this feature solve?

2. Choose metrics - Primary (e.g., conversion rate) + Secondary (e.g., time spent)

3. Set baseline - What's the current performance?

4. Run A/B test - Compare with control group

5. Analyze results - Statistical significance + practical significance

6. Consider tradeoffs - Did improving one metric hurt another?

Q2: Revenue dropped 10% this month. How would you investigate?
1. Verify data quality - Is the data correct? Any logging issues?

2. Segment analysis:

• By region - Is it all regions or specific ones?

• By product - Which products declined?

• By customer type - New vs returning?

• By channel - Web vs mobile vs app?

3. Time analysis - When did it start? Gradual or sudden?

4. External factors - Seasonality? Competitor actions? Economic conditions?

5. Internal changes - Price changes? Marketing spend? Site changes?

Q3: What metrics would you track for an e-commerce business?


Revenue metrics: GMV, Revenue, AOV (Average Order Value)

Conversion funnel: Visitors → Product views → Add to cart → Purchase

Customer metrics: CAC, LTV, Retention rate, Repeat purchase rate

Operational: Cart abandonment rate, Delivery time, Return rate

Q4: How would you explain correlation vs causation to a non-technical person?


'Just because two things happen together doesn't mean one causes the other.'

Example: Ice cream sales and drowning deaths both increase in summer. Ice cream doesn't cause
drowning - hot weather causes both.

To prove causation, you need controlled experiments where you change one thing and keep everything
else the same.

Q5: How do you prioritize what to analyze when you have many possible questions?
1. Business impact - What decisions will this inform?

2. Urgency - Is this blocking something important?


3. Feasibility - Do we have the data? How long will it take?

4. Stakeholder needs - Who asked for this and why?

Focus on high-impact + feasible items first.


PART 5: Behavioral Questions
Q1: Tell me about yourself.
Structure (2 minutes max):

• Present: Current role/what you're doing now

• Past: Relevant experience/education

• Future: Why this role/company

End with why you're excited about THIS opportunity.

Q2: Tell me about a project you're proud of.


Use STAR method:

• Situation: Context and problem

• Task: Your specific responsibility

• Action: What YOU did (be specific)

• Result: Quantified outcome (increased X by Y%)

Q3: How do you handle conflicting priorities?


'I clarify urgency and impact with stakeholders, communicate tradeoffs, and if needed, escalate to get
alignment on priority.'

Q4: Describe a time when your analysis was challenged.


Show you can:

• Listen without being defensive

• Validate concerns by re-checking your work

• Either defend with evidence OR acknowledge the error

• Learn from the experience

Q5: Why data analytics?


Be genuine. Talk about:

• What excites you about working with data

• A specific moment that sparked your interest

• How you see data driving real impact

Final Interview Tips


✓ Think out loud - Interviewers want to see your thought process

✓ Ask clarifying questions - It's expected and shows you're thorough


✓ Admit what you don't know - 'I'm not sure, but I would approach it by...'

✓ Practice SQL on paper - Many interviews don't have a computer

✓ Prepare 2-3 project stories - Know them inside out

✓ Research the company - Know their product, data challenges, recent news

You've got this. Prepare well, stay calm, and show them how you think. Good luck! ■

You might also like