Data Analyst Interview Questions
SQL + Python + Statistics + Business
50+ questions with answers to crack your next interview
What's Inside:
• SQL Questions (Most asked - 15 questions)
• Python/Pandas Questions (10 questions)
• Statistics Questions (10 questions)
• Business/Case Study Questions (10 questions)
• Behavioral Questions (5 questions)
PART 1: SQL Questions
SQL is asked in 95% of data analyst interviews. Master these.
Q1: What's the difference between WHERE and HAVING?
Answer:
• WHERE filters rows BEFORE grouping (works on raw data)
• HAVING filters AFTER grouping (works on aggregated results)
-- WHERE example: Filter before grouping
SELECT department, AVG(salary)
FROM employees
WHERE status = 'Active' -- filters individual rows first
GROUP BY department;
-- HAVING example: Filter after grouping
SELECT department, AVG(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 50000; -- filters grouped results
Q2: Find the second highest salary from Employee table.
Answer: Multiple approaches:
-- Method 1: Subquery
SELECT MAX(salary) FROM employees
WHERE salary < (SELECT MAX(salary) FROM employees);
-- Method 2: LIMIT OFFSET (MySQL/PostgreSQL)
SELECT DISTINCT salary FROM employees
ORDER BY salary DESC
LIMIT 1 OFFSET 1;
-- Method 3: DENSE_RANK (works for Nth highest)
SELECT salary FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) as rnk
FROM employees
) ranked
WHERE rnk = 2;
Q3: Find duplicate records in a table.
SELECT email, COUNT(*) as count
FROM customers
GROUP BY email
HAVING COUNT(*) > 1;
-- To see all duplicate rows:
SELECT * FROM customers
WHERE email IN (
SELECT email FROM customers
GROUP BY email
HAVING COUNT(*) > 1
);
Q4: What are different types of JOINs?
Answer:
• INNER JOIN - Only matching rows from both tables
• LEFT JOIN - All rows from left + matching from right (NULL if no match)
• RIGHT JOIN - All rows from right + matching from left
• FULL OUTER JOIN - All rows from both (NULL where no match)
• CROSS JOIN - Every row from A with every row from B (Cartesian product)
Q5: Difference between RANK, DENSE_RANK, and ROW_NUMBER?
-- Data: Salaries = 100, 100, 90, 80
-- ROW_NUMBER: 1, 2, 3, 4 (unique numbers, no gaps)
-- RANK: 1, 1, 3, 4 (same value = same rank, gaps after ties)
-- DENSE_RANK: 1, 1, 2, 3 (same value = same rank, no gaps)
SELECT
name, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) as row_num,
RANK() OVER (ORDER BY salary DESC) as rank,
DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank
FROM employees;
Q6: Write a query to find employees who earn more than their manager.
SELECT [Link] as employee, [Link] as emp_salary,
[Link] as manager, [Link] as mgr_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id
WHERE [Link] > [Link];
Q7: What's the difference between DELETE, TRUNCATE, and DROP?
• DELETE - Removes specific rows, can use WHERE, can rollback, slower
• TRUNCATE - Removes ALL rows, cannot use WHERE, faster, resets auto-increment
• DROP - Removes entire table structure + data, cannot recover
Q8: Find customers who placed orders in both 2023 AND 2024.
-- Using INTERSECT
SELECT customer_id FROM orders WHERE YEAR(order_date) = 2023
INTERSECT
SELECT customer_id FROM orders WHERE YEAR(order_date) = 2024;
-- Using JOIN
SELECT DISTINCT o1.customer_id
FROM orders o1
JOIN orders o2 ON o1.customer_id = o2.customer_id
WHERE YEAR(o1.order_date) = 2023 AND YEAR(o2.order_date) = 2024;
Q9: Calculate running total of sales.
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) as running_total
FROM sales;
-- Running total by customer
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY customer_id
ORDER BY order_date
) as customer_running_total
FROM sales;
Q10: Find the top 3 products by sales in each category.
SELECT * FROM (
SELECT
category,
product_name,
total_sales,
DENSE_RANK() OVER (
PARTITION BY category
ORDER BY total_sales DESC
) as rank
FROM products
) ranked
WHERE rank <= 3;
Quick SQL Questions:
Q11: UNION vs UNION ALL?
UNION removes duplicates (slower). UNION ALL keeps all rows (faster).
Q12: What is a Primary Key vs Foreign Key?
Primary Key: Unique identifier for each row, cannot be NULL.
Foreign Key: References primary key of another table, creates relationship.
Q13: What is normalization?
Organizing data to reduce redundancy. 1NF (no repeating groups), 2NF (no partial dependencies), 3NF
(no transitive dependencies).
Q14: What is an index? When to use?
Index speeds up data retrieval (like a book index). Use on columns used in WHERE, JOIN, ORDER BY.
Don't over-index - slows down INSERT/UPDATE.
Q15: NULL vs empty string vs 0?
NULL = unknown/missing value. Empty string = blank text (exists but empty). 0 = actual numeric value.
■ SQL Interview Tip: Always talk through your approach before writing. 'First I'll filter X, then join with Y,
then aggregate by Z.' Shows you can think, not just memorize.
PART 2: Python/Pandas Questions
Q1: How do you handle missing values in pandas?
# Check missing values
[Link]().sum()
# Drop rows with any missing values
[Link]()
# Drop rows where specific columns are missing
[Link](subset=['email', 'phone'])
# Fill missing values
df['salary'].fillna(df['salary'].mean(), inplace=True)
df['category'].fillna('Unknown', inplace=True)
# Forward fill (use previous value)
df['value'].fillna(method='ffill')
Q2: Difference between loc and iloc?
• loc - Label-based selection (uses column names, index labels)
• iloc - Integer-based selection (uses position numbers)
# loc (by label)
[Link][0:5, ['name', 'salary']] # rows 0-5 (inclusive), specific columns
# iloc (by position)
[Link][0:5, 0:3] # rows 0-4 (exclusive end), columns 0-2
Q3: How to merge two DataFrames?
# Inner join (default)
merged = [Link](df1, df2, on='customer_id')
# Left join
merged = [Link](df1, df2, on='customer_id', how='left')
# Join on different column names
merged = [Link](df1, df2, left_on='cust_id', right_on='customer_id')
# Multiple join columns
merged = [Link](df1, df2, on=['customer_id', 'product_id'])
Q4: Explain groupby in pandas.
# Basic groupby
[Link]('category')['sales'].sum()
# Multiple aggregations
[Link]('category')['sales'].agg(['sum', 'mean', 'count'])
# Group by multiple columns
[Link](['region', 'category'])['sales'].sum()
# Different aggregations for different columns
[Link]('category').agg({
'sales': 'sum',
'quantity': 'mean',
'customer_id': 'nunique'
})
Q5: How to create a new column based on conditions?
# Simple condition
df['status'] = [Link](df['score'] >= 50, 'Pass', 'Fail')
# Multiple conditions
df['grade'] = [Link](
[df['score'] >= 90, df['score'] >= 80, df['score'] >= 70],
['A', 'B', 'C'],
default='F'
)
# Using apply with function
def categorize(row):
if row['sales'] > 10000:
return 'High'
elif row['sales'] > 5000:
return 'Medium'
return 'Low'
df['tier'] = [Link](categorize, axis=1)
More Python Questions:
Q6: List vs Tuple vs Set vs Dictionary?
List: Ordered, mutable, allows duplicates [1, 2, 3]
Tuple: Ordered, immutable, allows duplicates (1, 2, 3)
Set: Unordered, mutable, NO duplicates {1, 2, 3}
Dict: Key-value pairs, keys unique {'a': 1, 'b': 2}
Q7: What is a lambda function?
# Lambda = anonymous one-line function
square = lambda x: x**2
# Common use with pandas
df['column'].apply(lambda x: [Link]())
df.sort_values(key=lambda x: [Link]())
Q8: How to remove duplicates in pandas?
df.drop_duplicates() # All columns
df.drop_duplicates(subset=['email']) # Based on email
df.drop_duplicates(keep='last') # Keep last occurrence
Q9: Difference between apply, map, and applymap?
• apply: Works on rows or columns (Series or DataFrame)
• map: Works element-wise on a Series only
• applymap: Works element-wise on entire DataFrame
Q10: How to pivot a DataFrame?
# Pivot table
pivot = df.pivot_table(
values='sales',
index='region',
columns='category',
aggfunc='sum'
)
# Melt (unpivot)
melted = [Link](df, id_vars=['id'], value_vars=['Q1', 'Q2', 'Q3'])
PART 3: Statistics Questions
Q1: Mean vs Median vs Mode - when to use what?
Mean: Average of all values. Use for normally distributed data.
Median: Middle value. Use when data has outliers (salary, house prices).
Mode: Most frequent value. Use for categorical data.
Example: Salaries of 50K, 55K, 60K, 65K, 500K → Mean=146K (misleading), Median=60K (better)
Q2: What is standard deviation?
Measures how spread out data is from the mean. Low SD = data clustered around mean. High SD = data
spread out.
68% of data within 1 SD, 95% within 2 SD, 99.7% within 3 SD (for normal distribution).
Q3: What is correlation? What does correlation of 0.8 mean?
Correlation measures linear relationship between two variables (-1 to +1).
+1 = Perfect positive (both increase together)
-1 = Perfect negative (one increases, other decreases)
0 = No linear relationship
0.8 = Strong positive correlation - when one increases, the other likely increases too.
■■ Correlation ≠ Causation. Ice cream sales and drowning are correlated (both high in summer), but
one doesn't cause the other.
Q4: What is a p-value?
Probability of getting your result if the null hypothesis is true.
p < 0.05: Result is 'statistically significant' - unlikely due to random chance.
p > 0.05: Cannot reject null hypothesis - result might be due to chance.
Q5: Type I vs Type II error?
Type I (False Positive): Rejecting true null hypothesis. Example: Saying drug works when it doesn't.
Type II (False Negative): Failing to reject false null. Example: Saying drug doesn't work when it does.
Q6: What is the Central Limit Theorem?
When you take many samples from ANY distribution, the sample means will be normally distributed.
This is why normal distribution is so important - it shows up everywhere.
Q7: What is A/B testing?
Comparing two versions to see which performs better.
• Split users randomly into control (A) and treatment (B)
• Measure metric (conversion, clicks, revenue)
• Use statistical test to determine if difference is significant
Q8: Explain the difference between population and sample.
Population: Entire group you want to study (all customers).
Sample: Subset you actually measure (1000 surveyed customers).
We use samples because measuring entire population is often impossible.
Q9: What is a confidence interval?
Range where true value likely lies. '95% confidence interval of 45-55' means:
If we repeated this study 100 times, 95 of those intervals would contain the true value.
Q10: Normal distribution vs skewed distribution?
Normal: Symmetric bell curve, mean = median = mode.
Right-skewed: Long tail to right (income, house prices). Mean > Median.
Left-skewed: Long tail to left (age at retirement). Mean < Median.
PART 4: Business/Case Study Questions
Q1: How would you measure the success of a new product feature?
Framework:
1. Define the goal - What problem does this feature solve?
2. Choose metrics - Primary (e.g., conversion rate) + Secondary (e.g., time spent)
3. Set baseline - What's the current performance?
4. Run A/B test - Compare with control group
5. Analyze results - Statistical significance + practical significance
6. Consider tradeoffs - Did improving one metric hurt another?
Q2: Revenue dropped 10% this month. How would you investigate?
1. Verify data quality - Is the data correct? Any logging issues?
2. Segment analysis:
• By region - Is it all regions or specific ones?
• By product - Which products declined?
• By customer type - New vs returning?
• By channel - Web vs mobile vs app?
3. Time analysis - When did it start? Gradual or sudden?
4. External factors - Seasonality? Competitor actions? Economic conditions?
5. Internal changes - Price changes? Marketing spend? Site changes?
Q3: What metrics would you track for an e-commerce business?
Revenue metrics: GMV, Revenue, AOV (Average Order Value)
Conversion funnel: Visitors → Product views → Add to cart → Purchase
Customer metrics: CAC, LTV, Retention rate, Repeat purchase rate
Operational: Cart abandonment rate, Delivery time, Return rate
Q4: How would you explain correlation vs causation to a non-technical person?
'Just because two things happen together doesn't mean one causes the other.'
Example: Ice cream sales and drowning deaths both increase in summer. Ice cream doesn't cause
drowning - hot weather causes both.
To prove causation, you need controlled experiments where you change one thing and keep everything
else the same.
Q5: How do you prioritize what to analyze when you have many possible questions?
1. Business impact - What decisions will this inform?
2. Urgency - Is this blocking something important?
3. Feasibility - Do we have the data? How long will it take?
4. Stakeholder needs - Who asked for this and why?
Focus on high-impact + feasible items first.
PART 5: Behavioral Questions
Q1: Tell me about yourself.
Structure (2 minutes max):
• Present: Current role/what you're doing now
• Past: Relevant experience/education
• Future: Why this role/company
End with why you're excited about THIS opportunity.
Q2: Tell me about a project you're proud of.
Use STAR method:
• Situation: Context and problem
• Task: Your specific responsibility
• Action: What YOU did (be specific)
• Result: Quantified outcome (increased X by Y%)
Q3: How do you handle conflicting priorities?
'I clarify urgency and impact with stakeholders, communicate tradeoffs, and if needed, escalate to get
alignment on priority.'
Q4: Describe a time when your analysis was challenged.
Show you can:
• Listen without being defensive
• Validate concerns by re-checking your work
• Either defend with evidence OR acknowledge the error
• Learn from the experience
Q5: Why data analytics?
Be genuine. Talk about:
• What excites you about working with data
• A specific moment that sparked your interest
• How you see data driving real impact
Final Interview Tips
✓ Think out loud - Interviewers want to see your thought process
✓ Ask clarifying questions - It's expected and shows you're thorough
✓ Admit what you don't know - 'I'm not sure, but I would approach it by...'
✓ Practice SQL on paper - Many interviews don't have a computer
✓ Prepare 2-3 project stories - Know them inside out
✓ Research the company - Know their product, data challenges, recent news
You've got this. Prepare well, stay calm, and show them how you think. Good luck! ■