0% found this document useful (0 votes)
6 views59 pages

Microservices Interview Guide

The document is a comprehensive interview guide on Microservices Architecture, covering 87 questions related to system design, communication methods, orchestration, and various patterns used in microservices. It discusses the advantages and disadvantages of microservices, migration strategies from monolithic architectures, and includes practical design scenarios. The guide serves as a resource for understanding microservices concepts, best practices, and industry applications.

Uploaded by

Surya Prasann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views59 pages

Microservices Interview Guide

The document is a comprehensive interview guide on Microservices Architecture, covering 87 questions related to system design, communication methods, orchestration, and various patterns used in microservices. It discusses the advantages and disadvantages of microservices, migration strategies from monolithic architectures, and includes practical design scenarios. The guide serves as a resource for understanding microservices concepts, best practices, and industry applications.

Uploaded by

Surya Prasann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Microservices Architecture

Complete Interview Guide

87 Questions · System Design · Diagrams · Code Examples · Industry Scenarios


Table of Contents
Q1. What is Microservices Architecture? ............................................................. 3
Q2. Monolithic vs Microservices architecture? ....................................................... 3
Q3. Why migrate to Microservices? Migration considerations? ........................................ 3
Q4. Advantages and disadvantages of Microservices? .................................................. 4
Q5. When is Microservices NOT recommended? .......................................................... 4
Q6. Monolithic vs SOA vs Microservices? ............................................................. 4
Q7. REST vs Kafka — how microservices communicate? .................................................. 5
Q8. Synchronous vs asynchronous communication? ...................................................... 5
Q9. How do you manage microservices orchestration? .................................................. 5
Q10. What is Service Discovery? Eureka Server? ....................................................... 6
Q11. Circuit Breaker pattern — Resilience4j? ......................................................... 6
Q12. Retry Pattern — design a retry mechanism? ....................................................... 7
Q13. Saga Pattern — Choreography vs Orchestration. Saga vs 2PC? ..................................... 7
Q14. Distributed Transactions — why not @Transactional across services? .............................. 8
Q15. What is Eventual Consistency? ................................................................... 8
Q16. What is idempotency? Idempotency keys? .......................................................... 8
Q17. What is a Dead Letter Queue (DLQ)? ............................................................... 9
Q18. What is a transactionId and how is it used? ...................................................... 9
Q19. What is reconciliation in distributed systems? .................................................. 10
Q20. How do you ensure data consistency across microservices? ........................................ 10
Q21. Payment success but booking fails — how to handle? .............................................. 10
Q22. Booking service triggered after payment — what if it's down? .................................... 11
Q23. How do you handle refunds in a distributed system? .............................................. 11
Q24. How do you prevent duplicate ticket booking? .................................................... 11
Q25. How do you ensure system reliability in distributed failures? ................................... 12
Q26. Event-driven architecture — Kafka's role? ....................................................... 12
Q27. How do services communicate after payment completion? ........................................... 12
Q28. What is Feign Client and how does it work? ...................................................... 13
Q29. What happens if a dependent service is down? .................................................... 13
Q30. Cascading failure — how to prevent it? .......................................................... 13
Q31. What is a load balancer? How does it work? ...................................................... 14
Q32. How do you implement horizontal scaling? ........................................................ 14
Q33. Vertical vs horizontal scaling? ................................................................. 14
Q34. How do you scale the database? .................................................................. 15
Q35. How do you manage auto-scaling during traffic spikes? ........................................... 15
Q36. How do you design a cloud-native application? ................................................... 15
Q37. What is Cloud Native? ........................................................................... 16
Q38. How do you ensure fault tolerance in microservices? ............................................. 16
Q39. How do you ensure scalability in microservices? ................................................. 16
Q40. How do you handle high traffic (100 to 1000 users)? ............................................. 17
Q41. How do you monitor and debug production issues? ................................................. 17
Q42. How do you debug a microservice when it becomes slow? ........................................... 17
Q43. How do you identify bottlenecks in microservices? ............................................... 18
Q44. What is the on-call / incident management process? .............................................. 18

Microservices Architecture — Complete Interview Guide Page 2


Q45. How is work divided across teams? Code reviews and releases? ................................... 18
Q46. Tools for logging, monitoring, tracing (ELK, Prometheus, Grafana, etc.)? ....................... 19
Q47. Spring Boot Actuator in microservices for monitoring? ........................................... 19
Q48. What is Micrometer? ............................................................................. 20
Q49. How do you handle failures — circuit breakers, retries, fallback? .............................. 20
Q50. How do you design a system for read-heavy traffic? .............................................. 20
Q51. What is the Hot Key Problem in Redis? ............................................................ 21
Q52. What is Sharding and how does it help? .......................................................... 21
Q53. Client-side vs server-side service discovery? ................................................... 21
Q54. Design a scalable backend for an IoT-enabled ambulance tracking system. ........................ 22
Q55. Design a ticket management system using microservices. .......................................... 22
Q56. Design a notification system using the Open/Closed Principle. .................................. 22
Q57. Design a price drop notification system for e-commerce. ......................................... 23
Q58. Design a real-time SMS notification system. ..................................................... 23
Q59. Design a scalable file upload system for large files (2GB+). ................................... 23
Q60. Explain Group Anagrams problem and its time complexity. ......................................... 24
Q61. What is CQRS? How does data flow in CQRS? ....................................................... 24
Q62. Other approaches besides Redis and CDN for high traffic? ........................................ 24
Q63. How to implement a custom Spring Security filter? ............................................... 25
Q64. What is P2P communication between microservices? Problems? ..................................... 25
Q65. What is traffic routing in microservices? ....................................................... 25
Q66. How do you handle inter-service communication failure? .......................................... 26
Q67. What is WebSocket? How does it differ from REST? ................................................ 26
Q68. Design a 1-to-1 video call architecture. Group video call? ..................................... 26
Q69. Signaling server? STUN/TURN server? WebRTC? ..................................................... 27
Q70. What is HLD vs LLD? ............................................................................. 27
Q71. Explain your E-commerce system architecture (HLD). .............................................. 27
Q72. How do you design the Inventory Service? ........................................................ 28
Q73. How to design a Cart Service? ................................................................... 28
Q74. How to design a Notification Service with multi-channel support? ................................ 29
Q75. What design patterns are used in a Notification Service? ........................................ 29
Q76. What is FreeMarker? How do you design email templates? .......................................... 29
Q77. How to confirm that 1000 notifications were sent successfully? .................................. 30
Q78. How to design a payment system? ................................................................. 30
Q79. How to handle a scheduler vs webhook race condition? ............................................ 30
Q80. Components in a Kafka-based microservices architecture? ......................................... 31
Q81. How do you ensure zero downtime deployments? .................................................... 31
Q82. Canary Deployment vs Blue-Green Deployment? ..................................................... 32
Q83. What is Observability in microservices? .......................................................... 32
Q84. What is the ELK Stack? Centralised logging? ..................................................... 32
Q85. How do logs get added to Splunk? ................................................................ 33
Q86. How does distributed logging work using Correlation IDs? ........................................ 33
Q87. What is the difference between Grafana and Prometheus? .......................................... 33

Microservices Architecture — Complete Interview Guide Page 3


Q1. What is Microservices Architecture? Give a 4-5 line definition.

Microservices Architecture is a software design approach where a large application is broken into a
collection of small, independently deployable services. Each service is responsible for a specific business
capability (e.g., Order Service, Payment Service, User Service), runs in its own process, and
communicates over lightweight protocols like HTTP/REST or messaging queues. Each service has its own
database, can be deployed independently, and can be written in any technology stack. Together they form
the complete application, but each can be scaled, updated, or replaced without affecting others.

Characteristic Description

Single Responsibility Each service does one business function well

Independent Deployment Deploy one service without redeploying others

Decentralized Data Each service owns its database

Fault Isolation Failure in one service doesn't crash the whole app

Technology Freedom Each service can use its own language/framework

Industry Use: Netflix decomposed their monolith into 700+ microservices to handle 200M+ streaming
users. Each team owns their service independently.

Q2. What is the difference between Monolithic and Microservices architecture?

Monolithic Microservices

• All features in one codebase • Features split into services


• Single deployable unit • Each service deployed independently
• Shared database • Each service has its DB
• Scale entire app together • Scale specific services only
• One tech stack • Each service chooses its stack
• Simple to develop initially • Complex to develop/operate
• Hard to maintain at scale • Easy to maintain at scale

Monolithic vs Microservices — Side-by-Side

Microservices Architecture — Complete Interview Guide Page 4


// Monolithic: All logic in one Spring Boot app
@RestController
public class OrderController {
@Autowired UserService userService; // same JVM
@Autowired InventoryService inventory; // same JVM
@Autowired PaymentService payment; // same JVM
}

// Microservices: Services talk over HTTP/Kafka


@RestController
public class OrderController {
@Autowired FeignClient userClient; // HTTP call to User Service
@Autowired KafkaTemplate kafka; // async to Inventory Service
}

Q3. Why migrate from Monolithic to Microservices? Migration considerations?

Reasons to migrate:
• Deployment bottleneck: small change requires full redeploy
• Scaling: can't scale just the checkout service during flash sales
• Team size: large teams step on each other's code
• Tech debt: old stack limits adoption of new frameworks
• Fault isolation: one bug takes down the whole app

Migration strategy — Strangler Fig Pattern:

Step 1: Identify bounded contexts (Order, Payment, User, Inventory)

■ Step 2: Extract the least-coupled service first (e.g., Notification)

■ Step 3: Add an API Gateway to route to old monolith OR new service

■ Step 4: Gradually move traffic from monolith to new service

■ Step 5: Decommission the corresponding monolith module

■ Step 6: Repeat until monolith is fully replaced

Key migration considerations:


• Database decomposition: split the shared DB (use CDC — Change Data Capture)
• Define clear service boundaries via Domain-Driven Design (DDD)
• Handle distributed transactions (Saga pattern)
• Set up observability: centralized logging, distributed tracing
• Run both monolith and services in parallel during transition
Industry Use: Amazon migrated in the early 2000s using the Strangler Fig approach — slowly carving
out services around the edges of their monolith. Today they run thousands of microservices.

Q4. What are the advantages and disadvantages of Microservices?

Microservices Architecture — Complete Interview Guide Page 5


Advantages Disadvantages

• Independent deployability • Distributed system complexity


• Technology heterogeneity • Network latency between services
• Fault isolation • Data consistency challenges
• Granular scalability • More infrastructure overhead
• Small, focused teams • Harder to debug across services
• Easier to understand each service • Operational complexity (K8s, service mesh)
• Faster CI/CD per service • Requires mature DevOps culture

Microservices Pros and Cons

Q5. When is Microservices architecture NOT recommended?

Microservices add significant operational complexity. They are NOT recommended when:

• Small team (1-5 devs): overhead of managing 10+ services outweighs benefits
• Early-stage startup: business domain not yet stable — boundaries will change
• Simple CRUD app: a blog, internal tool, or admin panel doesn't need microservices
• No DevOps maturity: without CI/CD, container orchestration, and monitoring in place
• Tight latency requirements: inter-service HTTP calls add ms latency vs in-process calls

■ Start with a well-structured monolith. Extract services only when you have a real scaling or
team independence problem. 'Microservices-first' is an anti-pattern for most startups.

Q6. What is the difference between Monolithic, SOA, and Microservices?

Aspect Monolithic SOA Microservices

Size One large app Large services Small focused services

Communication In-process ESB (Enterprise Service REST/gRPC/Kafka


Bus)

Data Shared DB May share DB Each service owns DB

Deployment Single unit Multiple services Each service independent

Coupling High Medium (ESB bottleneck) Loose

Use case Small apps Enterprise integration Modern cloud apps

Q7. How do microservices communicate with each other (REST vs Kafka


decisions)?

Microservices Architecture — Complete Interview Guide Page 6


REST / gRPC (Synchronous) Kafka / RabbitMQ (Asynchronous)

• Caller waits for response • Caller doesn't wait — fire and forget
• Use when you need immediate answer • Use for eventual consistency workflows
• Example: check stock before checkout • Example: send email after order placed
• Simpler to implement • Decoupled — receiver can be down
• Tight coupling — both must be up • Better fault tolerance
• Tools: Feign Client, RestTemplate, gRPC • Tools: Spring Kafka, RabbitMQ, AWS SQS

When to choose REST vs Kafka

Decision rule:
• Sync (REST/gRPC): User expects immediate response (login, payment status check, inventory
check)
• Async (Kafka): Background work that doesn't block user (send email, update analytics, sync
inventory)
Industry Use: In e-commerce: Payment → Order is REST (must confirm before proceeding). Order →
Notification is Kafka (email can arrive seconds later).

Q8. What is synchronous vs asynchronous communication in microservices?

Synchronous Communication:

Client → Service A → (HTTP call) → Service B → returns response

Service A WAITS for Service B response

If B is slow → A is slow → Client is slow (cascading latency)

If B is down → A fails → Client gets error

Asynchronous Communication:

Client → Service A → publishes event to Kafka → returns 202 Accepted

Service A does NOT wait

Service B (consumer) picks up event when ready

If B is down → event stays in Kafka → processed when B recovers

Microservices Architecture — Complete Interview Guide Page 7


// Synchronous — REST via Feign
@FeignClient(name="inventory-service")
public interface InventoryClient {
@GetMapping("/stock/{productId}")
StockResponse checkStock(@PathVariable Long productId);
}

// Asynchronous — Kafka
@Service
public class OrderService {
@Autowired KafkaTemplate<String,OrderEvent> kafka;
public void placeOrder(Order o) {
[Link](o);
[Link]("order-events", new OrderEvent([Link](), "PLACED"));
// returns immediately — no waiting for downstream
}
}

Q9. How do you manage microservices orchestration?

Orchestration means one central service (orchestrator) coordinates a multi-step workflow by calling other
services in sequence. Contrast with Choreography where services react to events independently.

Orchestration flow (Order placement):

Order Orchestrator starts

■ 1. Call Inventory Service → reserve stock

■ 2. Call Payment Service → charge customer

■ 3. Call Shipping Service → create shipment

■ 4. Call Notification Service → send confirmation email

■ If any step fails → call compensating transactions (rollback)

// Orchestrator using Saga with Spring State Machine or Temporal


@Service
public class OrderOrchestrator {
public OrderResult placeOrder(OrderRequest req) {
try {
[Link]([Link](), [Link]());
[Link]([Link](), [Link]());
[Link](req);
[Link]([Link]());
return [Link];
} catch (PaymentFailedException e) {
[Link]([Link](), [Link]()); // compensate
return OrderResult.PAYMENT_FAILED;
}
}
}

Microservices Architecture — Complete Interview Guide Page 8


Industry Use: Tools like AWS Step Functions and Temporal are used in industry for durable
orchestration — they persist workflow state so it survives restarts.

Q10. What is Service Discovery? What is Eureka Server and why is it needed?

In microservices, service instances come and go (due to scaling, restarts, failures). Their IPs and ports
change dynamically. Service Discovery solves the problem of 'how does Service A find the current
address of Service B?'

How Eureka works:

1. Each service registers itself with Eureka on startup (sends host:port, service name)

■ 2. Each service sends heartbeat to Eureka every 30s (keeps registration alive)

■ 3. When Service A needs to call Service B:

Service A asks Eureka: 'Where is order-service?'

■ 4. Eureka returns list of healthy instances

■ 5. Service A picks one (Ribbon/Spring Cloud LoadBalancer does round-robin)

■ 6. Service A makes HTTP call to that instance

// Eureka Server
@SpringBootApplication @EnableEurekaServer
public class DiscoveryServer { ... }

# [Link] — eureka server


[Link]: 8761
[Link]-with-eureka: false
[Link]-registry: false

// Each microservice — Eureka client


@SpringBootApplication @EnableEurekaClient
public class OrderService { ... }

# [Link] — order service


[Link]: order-service
[Link]: [Link]

Client-side vs Server-side discovery:


• Client-side (Eureka + Ribbon): Client fetches registry and does load balancing itself
• Server-side (AWS ALB, Kubernetes): A load balancer routes traffic; client just calls a fixed DNS
name
Industry Use: Kubernetes replaces Eureka for most cloud-native apps — Kubernetes DNS provides
built-in service discovery via Service resources.

Q11. What is Circuit Breaker pattern? How does Resilience4j implement it?

Microservices Architecture — Complete Interview Guide Page 9


The Circuit Breaker pattern prevents cascading failures. When Service B is failing/slow, instead of Service
A repeatedly calling and waiting (wasting threads), the circuit 'opens' and calls fail fast with a fallback
response.

Three states:

State Behavior Transition

CLOSED (normal) All calls pass through If failures > threshold → OPEN

OPEN (broken) All calls fail immediately (no After wait duration → HALF-OPEN
wait)

HALF-OPEN (testing) Allow few test calls through Success → CLOSED; Fail → OPEN

CLOSED → (50% failures in 10 calls) → OPEN

OPEN → (30 second wait) → HALF-OPEN

HALF-OPEN → (3 test calls succeed) → CLOSED

HALF-OPEN → (test call fails) → OPEN

// [Link]
// spring-cloud-starter-circuitbreaker-resilience4j

@Service
public class OrderService {

@CircuitBreaker(name="paymentService", fallbackMethod="paymentFallback")
@Retry(name="paymentService")
@TimeLimiter(name="paymentService")
public CompletableFuture<String> processPayment(PaymentRequest req) {
return [Link](() ->
[Link](req));
}

public CompletableFuture<String> paymentFallback(PaymentRequest req,


Exception ex) {
// Return cached response or queue for retry
return [Link]("PAYMENT_QUEUED");
}
}

# [Link]
[Link]:
slidingWindowSize: 10
failureRateThreshold: 50 # open if 50% fail
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 3

Industry Use: Used everywhere: Netflix Hystrix (now deprecated) → Resilience4j. At Zepto, circuit
breakers wrap calls to payment gateway so checkout doesn't hang when Razorpay is slow.

Microservices Architecture — Complete Interview Guide Page 10


Q12. What is the Retry Pattern? How do you design a retry mechanism between
microservices?

The Retry pattern automatically retries failed calls to a service, handling transient failures (brief network
hiccups, temporary overload). It must be used carefully to avoid making problems worse.

Key retry design decisions:


• Exponential backoff: Wait 1s, 2s, 4s, 8s between retries — don't hammer a struggling service
• Jitter: Add random delay to prevent thundering herd (all instances retrying at same time)
• Max attempts: Set a limit (3-5 retries) — don't retry forever
• Idempotency: Only retry idempotent operations — don't retry a payment twice!

# [Link] — Resilience4j Retry


[Link]:
maxAttempts: 3
waitDuration: 1s
enableExponentialBackoff: true
exponentialBackoffMultiplier: 2 # 1s, 2s, 4s
retryExceptions:
- [Link]
- [Link]
ignoreExceptions:
- [Link] # don't retry business errors

@Service
public class InventoryService {
@Retry(name="inventoryService", fallbackMethod="fallback")
public StockResponse checkStock(Long productId) {
return [Link](productId);
}
public StockResponse fallback(Long productId, Exception e) {
return [Link](productId); // serve from cache
}
}

Q13. What is the Saga Pattern? Choreography vs Orchestration. Saga vs 2PC?

The Saga Pattern is a way to manage distributed transactions across multiple microservices. Instead of
one ACID transaction, a Saga is a sequence of local transactions, each publishing an event that triggers
the next. If a step fails, compensating transactions undo previous steps.

Choreography Saga (event-driven, decentralised):

Order Service → saves order (PENDING) → publishes OrderPlaced event

■ Inventory Service consumes → reserves stock → publishes StockReserved

■ Payment Service consumes → charges card → publishes PaymentSuccess

■ Shipping Service consumes → creates shipment → publishes ShipmentCreated

Microservices Architecture — Complete Interview Guide Page 11


If Payment fails: PaymentFailed event → Inventory releases stock (compensation)

Orchestration Saga (centralised coordinator):

Order Orchestrator (Saga coordinator) directs each step

■ 1. Command: ReserveInventory → Inventory Service

■ 2. Command: ProcessPayment → Payment Service

■ 3. Command: CreateShipment → Shipping Service

■ On failure: Orchestrator sends compensating commands in reverse order

Aspect Saga Pattern 2PC (Two-Phase Commit)

Consistency Eventual consistency Strong ACID consistency

Blocking Non-blocking Blocks all participants until commit

Availability High (services stay responsive) Low (locking causes bottleneck)

Failure handling Compensating transactions Coordinator rollback (fragile)

Microservices fit Excellent Poor (tight coupling)

Industry Use: Saga is the industry standard for distributed transactions. Uber uses choreography for trip
sagas; Temporal/AWS Step Functions enable durable orchestration sagas.

Q14. What are Distributed Transactions? Why can't we use @Transactional


across services?
@Transactional works by controlling a database connection within a single JVM. The transaction manager
opens a DB connection, runs SQL, then commits or rolls back. This only works on ONE database in ONE
process.

In microservices, Order Service uses its own MySQL DB, Payment Service uses its own PostgreSQL DB.
There is no shared connection or shared transaction manager — @Transactional literally cannot span two
different databases on two different machines.

Microservices Architecture — Complete Interview Guide Page 12


// This CANNOT work across microservices:
@Transactional // Only covers local DB
public void placeOrder(Order o) {
[Link](o); // local MySQL — OK
[Link]([Link]()); // REMOTE HTTP call
// If payment fails here, [Link]() is NOT rolled back
// because the @Transactional only covers the local DB
}

// Solution: Outbox Pattern + Saga


@Transactional // local transaction only
public void placeOrder(Order o) {
[Link](o);
[Link](new OutboxEvent("ORDER_PLACED", [Link]()));
// A separate relay publishes the event to Kafka atomically
}

Industry Use: The Outbox Pattern ensures atomicity between DB write and Kafka publish by writing
both in the same local transaction.

Q15. What is Eventual Consistency?

Eventual Consistency means that all replicas/services will eventually converge to the same data state —
but there may be a short period where they are out of sync. Unlike strong consistency (where every read
sees the latest write immediately), eventual consistency trades immediate accuracy for higher availability
and performance.

User places order → Order Service saves PENDING, publishes OrderPlaced event

■ Inventory Service processes event (few milliseconds later) → updates stock

■ Between these two moments: Inventory DB shows old stock count

■ Eventually: Inventory DB reflects the order → consistent state reached

CAP Theorem context:


• Distributed systems can only guarantee 2 of 3: Consistency, Availability, Partition Tolerance
• Microservices choose AP (Available + Partition Tolerant) → accept eventual consistency
• Banking chooses CP (Consistent + Partition Tolerant) → strong consistency over availability
Industry Use: Amazon DynamoDB, Cassandra, and Kafka-based microservices all embrace eventual
consistency to achieve 99.99%+ availability at global scale.

Q16. What is idempotency? How do you handle duplicate requests using


idempotency keys?

An operation is idempotent if calling it multiple times produces the same result as calling it once. Critical in
distributed systems because network failures cause retries — you need to ensure duplicate requests don't
cause duplicate side effects (e.g., double charges).

Implementation with Idempotency Key:

Microservices Architecture — Complete Interview Guide Page 13


Client generates a unique idempotency key (UUID) per request

■ Client sends: POST /payments {amount: 500, idempotencyKey: 'abc-123'}

■ Server checks: is 'abc-123' already in the idempotency_keys table?

■ YES → return stored response (no duplicate processing)

■ NO → process payment → store {key: 'abc-123', response: ..., ttl: 24h}

■ Return response to client

■ If network fails and client retries → 'abc-123' found → same response returned

@PostMapping("/payments")
public ResponseEntity<PaymentResponse> pay(
@RequestHeader("Idempotency-Key") String iKey,
@RequestBody PaymentRequest req) {

Optional<IdempotencyRecord> existing =
[Link](iKey);
if ([Link]()) {
// Duplicate request — return stored response
return [Link]([Link]().getStoredResponse());
}

PaymentResponse result = [Link](req);


[Link](new IdempotencyRecord(iKey, result, [Link]()));
return [Link](result);
}

Industry Use: Stripe, Razorpay, and all major payment APIs require idempotency keys. Stripe stores
keys for 24 hours — any retry within that window gets the same response.

Q17. What is a Dead Letter Queue (DLQ)? When do you move a message to
DLQ?

A Dead Letter Queue (DLQ) is a special queue/topic where messages are sent when they cannot be
processed successfully after a configured number of retries. It prevents bad messages from blocking the
main queue indefinitely.

Consumer receives message from Kafka/SQS topic

■ Attempt 1: processing fails (e.g., DB down, validation error)

■ Attempt 2 (after backoff): fails again

■ Attempt 3: fails again

■ Max retries reached → message moved to DLQ topic

■ Main queue continues processing other messages (not blocked)

Microservices Architecture — Complete Interview Guide Page 14


■ Ops team monitors DLQ → investigates and replays or discards messages

# [Link] — Kafka DLQ config


[Link]-id: order-consumer-group

@Component
public class OrderEventConsumer {

@KafkaListener(topics="order-events")
@RetryableTopic(
attempts="3",
backoff=@Backoff(delay=1000, multiplier=2),
dltTopicSuffix="-dlt" // moves to 'order-events-dlt' after 3 failures
)
public void consume(OrderEvent event) {
[Link](event);
}

@DltHandler
public void handleDlt(OrderEvent event, @Header KafkaHeaders.RECEIVED_TOPIC String topic) {
[Link]("DLQ message from {}: {}", topic, event);
[Link](event); // notify ops team
}
}

Industry Use: In production Kafka setups, every consumer should have a DLQ. DLQs are monitored via
Grafana alerts — if DLQ message count spikes, it means a downstream service is broken.

Q18. What is a transactionId and how is it used?

A transactionId (also called correlationId or traceId) is a globally unique identifier assigned to a business
transaction that spans multiple microservices. It threads through all service calls, log entries, and DB
records, allowing you to reconstruct the full journey of one request.

Microservices Architecture — Complete Interview Guide Page 15


// API Gateway assigns transactionId on first request
@Component
public class TransactionIdFilter implements Filter {
public void doFilter(ServletRequest req, ServletResponse res,
FilterChain chain) throws ... {
String txId = ((HttpServletRequest) req)
.getHeader("X-Transaction-Id");
if (txId == null) txId = [Link]().toString();

[Link]("transactionId", txId); // add to all log lines


((HttpServletResponse) res).setHeader("X-Transaction-Id", txId);
[Link](req, res);
[Link]();
}
}

// Log pattern includes transactionId:


// 2024-01-15 [Link] [txId=abc-123] ORDER placed for user 42
// 2024-01-15 [Link] [txId=abc-123] INVENTORY reserved product 99
// 2024-01-15 [Link] [txId=abc-123] PAYMENT charged Rs 500

// Query logs for full trace:


grep 'txId=abc-123' [Link]

Industry Use: In ELK/Splunk, filtering by transactionId gives the complete end-to-end trace of one user
request across all microservices. Essential for production debugging.

Q19. What is reconciliation in distributed systems?

Reconciliation is the process of comparing data between two or more systems to detect and fix
inconsistencies. In distributed systems, due to network failures and eventual consistency, data can get out
of sync. Reconciliation jobs periodically compare source and target systems and correct mismatches.

Example — Payment reconciliation:

Every night at 2 AM: Reconciliation job runs

■ Fetch all 'PENDING' orders from Order DB where created > 24h

■ Call Payment Gateway API: get their transaction status for same IDs

■ Compare: Order DB says PENDING, Payment Gateway says SUCCESS

■ Fix: Update Order DB to PAID, trigger fulfillment

■ Also check: Payment Gateway says FAILED, Order DB says PENDING

■ Fix: Update to FAILED, trigger refund if needed

Microservices Architecture — Complete Interview Guide Page 16


@Scheduled(cron="0 0 2 * * *") // Run at 2 AM daily
public void reconcilePayments() {
List<Order> stuckOrders = orderRepo
.findByStatusAndCreatedAtBefore("PENDING",
[Link]().minusHours(1));

for (Order order : stuckOrders) {


PaymentStatus gwStatus =
[Link]([Link]());

if (gwStatus == SUCCESS && [Link]().equals("PENDING")) {


[Link]("PAID");
[Link](order);
[Link](order);
}
}
}

Industry Use: All payment companies run nightly reconciliation. NPCI (UPI) reconciles transactions
between banks every few minutes. IRCTC reconciles booking vs payment status every hour.

Q20. How do you ensure data consistency across microservices?

Data consistency in microservices requires careful design. Key patterns:

• Outbox Pattern: Write to DB and event table in one local transaction. A relay publishes events from
outbox table — guarantees no lost events.
• Saga Pattern: Long-running transactions with compensating actions (see Q13).
• Event Sourcing: Store state as a sequence of events — truth is the event log, not current state.
• Idempotent consumers: Consumers deduplicate events using event IDs.
• Reconciliation jobs: Periodic checks to detect and fix inconsistencies (see Q19).

Microservices Architecture — Complete Interview Guide Page 17


// Outbox Pattern
@Transactional
public void createOrder(Order order) {
[Link](order); // Step 1: save to orders table
// Step 2: save event to outbox table IN SAME TRANSACTION
[Link]([Link]()
.aggregateId([Link]())
.eventType("ORDER_CREATED")
.payload(toJson(order))
.status("PENDING")
.build());
// If anything fails, both roll back together
}

// Outbox relay (separate scheduled job)


@Scheduled(fixedDelay=100) // every 100ms
public void publishOutboxEvents() {
List<OutboxEvent> pending = [Link]("PENDING");
[Link](e -> {
[Link]([Link](), [Link]());
[Link]("PUBLISHED");
[Link](e);
});
}

Q21. What happens when payment is successful but booking/order fails? How
do you handle it?

This is the classic distributed transaction problem. Payment succeeded but the booking step failed —
customer was charged but no booking was made. The solution depends on whether you use Orchestration
or Choreography Saga.

Handling strategy:

Payment Service → charges customer → publishes PaymentSuccess event

■ Booking Service consumes event → tries to create booking → FAILS (DB down)

■ Booking Service retries 3 times (exponential backoff) → still fails

■ Booking Service publishes BookingFailed event

■ Payment Service consumes BookingFailed → initiates REFUND (compensating


transaction)

■ Refund is processed → customer gets money back

■ Notification Service sends 'Sorry, booking failed, refund initiated' email

Microservices Architecture — Complete Interview Guide Page 18


// Compensation: Payment Service listens for booking failure
@KafkaListener(topics="booking-events")
public void onBookingEvent(BookingEvent event) {
if ([Link]().equals("FAILED")) {
Payment payment = paymentRepo
.findByTransactionId([Link]());
if (payment != null && [Link]().equals("SUCCESS")) {
[Link](payment); // compensating transaction
[Link]("REFUNDED");
[Link](payment);
}
}
}

Industry Use: IRCTC and MakeMyTrip handle this pattern daily — payment gateway confirms deduction
but seat allotment fails; automatic refund within 5-7 business days is the result.

Q22. How does the booking service get triggered after payment? What if it's
down?

After payment success, the booking service is triggered via an asynchronous event (Kafka/SQS). If the
booking service is down, the event stays in Kafka's durable log — Kafka retains messages even if
consumers are offline.

Payment Service publishes '[Link]' to Kafka topic

■ Booking Service is DOWN — Kafka holds the message in the partition

■ Booking Service restarts after 10 minutes

■ Booking Service reconnects, resumes from last committed offset

■ Processes '[Link]' event → creates booking

■ Commits offset → message acknowledged

Note: Kafka message retention (default 7 days) ensures no event is lost even if a consumer is down for days. The
consumer group offset tracks exactly where processing left off.

Industry Use: This 'at-least-once delivery' guarantee from Kafka is why idempotency (Q16) is critical —
the booking service must check if this payment was already booked before creating a duplicate booking.

Q23. How do you handle refunds in a distributed system?

Trigger: BookingCancelled event OR manual refund request

■ Refund Service receives request

■ Check: Is payment eligible for refund? (within window, not already refunded)

■ Call Payment Gateway API to initiate refund

■ Save refund record with status INITIATED

Microservices Architecture — Complete Interview Guide Page 19


■ Payment Gateway sends webhook callback: refund SUCCESS/FAILED

■ Update refund status, publish RefundCompleted event

■ Notification Service: email user 'Your refund of Rs X is processed'

• Idempotency: Use the original paymentId as idempotency key for refund — prevents double refund
• Retry: If gateway webhook doesn't arrive in 24h, poll gateway API
• Reconciliation: Nightly job compares refund DB with gateway reports

Q24. How do you prevent duplicate ticket booking?

Duplicate booking can happen due to: user clicking twice, network retry, or message redelivery. Multiple
layers of protection:

• Layer 1 — Idempotency key: Client sends a unique bookingRequestId; server deduplicates


• Layer 2 — Database unique constraint: UNIQUE(userId, eventId, seatId) prevents DB-level
duplicates
• Layer 3 — Distributed lock: Redis SETNX lock on seatId before booking; only one process books a
seat
• Layer 4 — Idempotent consumer: Kafka consumer checks if bookingRequestId already exists
before processing

public BookingResult bookSeat(BookingRequest req) {


String lockKey = "seat:lock:" + [Link]();

// Acquire Redis lock (10s TTL — auto-releases if server dies)


boolean locked = [Link]()
.setIfAbsent(lockKey, [Link](), [Link](10));

if (!locked) {
throw new SeatAlreadyBeingBookedException([Link]());
}

try {
// Check DB unique constraint + idempotency
if ([Link]([Link](),
[Link]())) {
return BookingResult.ALREADY_BOOKED;
}
Booking b = [Link](new Booking(req));
return [Link](b);
} finally {
[Link](lockKey); // always release lock
}
}

Q25. How do you ensure system reliability in distributed failures?

System reliability in distributed systems requires multiple defensive patterns working together:

Microservices Architecture — Complete Interview Guide Page 20


Pattern What it protects against

Circuit Breaker Cascading failure when downstream is slow/down

Retry + Backoff Transient network failures

Bulkhead Thread pool exhaustion from one service

Timeout Slow dependencies blocking threads indefinitely

Fallback Degraded but functional response when service is down

DLQ Message processing failures blocking queue

Health checks Traffic to unhealthy instances

Rate limiting Overload from too many requests

Q26. What is event-driven architecture? What is Kafka's role?

Event-Driven Architecture (EDA) is a design where services communicate by producing and consuming
events (immutable records of what happened). Services are decoupled — producers don't know who
consumes their events.

Kafka's role in EDA:


• Message broker: Durable, distributed log — stores events persistently
• Pub/Sub: Multiple consumers can subscribe to same topic independently
• Replay: Consumers can replay events from any offset — great for rebuilding state
• High throughput: Handles millions of events/second at ms latency
• Partitioning: Events partitioned by key (e.g., orderId) — ensures ordering per key

Order Service → publishes to 'order-events' Kafka topic

→ Inventory Service (consumer group A) updates stock

→ Analytics Service (consumer group B) updates dashboards

→ Notification Service (consumer group C) sends email

All 3 consumers process independently — Order Service doesn't care

Microservices Architecture — Complete Interview Guide Page 21


// Producer
@Service
public class OrderService {
@Autowired KafkaTemplate<String, String> kafka;

public void placeOrder(Order order) {


[Link](order);
[Link]("order-events",
[Link]().toString(), // partition key
[Link](order));
}
}

// Consumer
@Component
public class InventoryConsumer {
@KafkaListener(topics="order-events",
groupId="inventory-service")
public void consume(ConsumerRecord<String, String> record) {
Order order = [Link]([Link](), [Link]);
[Link](order);
}
}

Q27. How do services communicate after payment completion?

Payment Service processes payment → SUCCESS

■ Payment Service publishes to Kafka: 'payment-events' topic

Event: {paymentId, orderId, userId, amount, status: 'SUCCESS', timestamp}

■ Order Service (consumer): updates order status to 'PAID'

■ Inventory Service (consumer): decrements available stock

■ Fulfillment Service (consumer): creates pick-pack-ship task

■ Notification Service (consumer): sends 'Payment confirmed' email/SMS

■ Analytics Service (consumer): records revenue event

Note: This is the fan-out pattern — one event triggers multiple independent consumers. Each consumer group
maintains its own offset — completely independent processing.

Q28. What is Feign Client and how does it work?

Feign is a declarative HTTP client from Spring Cloud. Instead of writing RestTemplate boilerplate, you
define an interface and Feign generates the implementation. It integrates with Eureka (service discovery),
Ribbon/LoadBalancer, Circuit Breaker, and Retry automatically.

Microservices Architecture — Complete Interview Guide Page 22


// 1. Enable Feign
@SpringBootApplication
@EnableFeignClients
public class OrderService { }

// 2. Define Feign client interface


@FeignClient(
name = "payment-service", // resolves via Eureka
fallback = [Link]
)
public interface PaymentClient {
@PostMapping("/api/v1/payments")
PaymentResponse charge(@RequestBody PaymentRequest request);

@GetMapping("/api/v1/payments/{id}")
PaymentResponse getPayment(@PathVariable Long id);
}

// 3. Fallback class
@Component
public class PaymentFallback implements PaymentClient {
public PaymentResponse charge(PaymentRequest req) {
return [Link]("SERVICE_UNAVAILABLE");
}
public PaymentResponse getPayment(Long id) {
return [Link]();
}
}

// 4. Use it — just like a local service call


@Service
public class CheckoutService {
@Autowired PaymentClient paymentClient;

public void checkout(Cart cart) {


PaymentResponse response = [Link](
[Link]([Link]()));
}
}

Industry Use: Feign is the standard for sync inter-service calls in Spring Cloud microservices. It handles
JSON serialization, load balancing, and circuit breaking transparently.

Q29. What happens if a dependent service is down?

Defence-in-depth approach — multiple layers to handle downstream service failures:

Timeout: Set connection timeout (1s) + read timeout (3s) on Feign client

■ If timeout hit → Retry (up to 3 times with exponential backoff)

■ After max retries → Circuit Breaker OPENS

■ Future calls FAIL FAST (no waiting) → Fallback triggered

Microservices Architecture — Complete Interview Guide Page 23


■ Fallback: return cached data / default response / queue for async retry

■ Circuit stays OPEN for 30s → HALF-OPEN → test call → back to CLOSED

# [Link] — Feign timeouts + Resilience4j


[Link]:
connectTimeout: 1000 # 1 second
readTimeout: 3000 # 3 seconds

[Link]-service:
failureRateThreshold: 50
waitDurationInOpenState: 30s

[Link]-service:
maxAttempts: 3
waitDuration: 500ms
enableExponentialBackoff: true

Q30. If one service is slow, how does it affect others (cascading failure)?

A cascading failure occurs when a slow/failing service exhausts the thread pool of its caller, which then
becomes slow, which exhausts ITS caller's thread pool — eventually the entire system freezes.

Payment Service is slow (taking 10s per call)

■ Order Service makes 100 concurrent calls to Payment Service

■ All 100 threads in Order Service are BLOCKED waiting for Payment

■ Order Service's thread pool exhausted → can't handle new requests

■ API Gateway calls Order Service → timeout → cascades up

■ All users see 'Service Unavailable' — entire app appears down

Prevention:
• Bulkhead pattern: Separate thread pools per downstream service — Payment slowness only affects
the Payment thread pool, not the entire Order Service
• Timeout: Never block longer than 3s on any downstream call
• Circuit breaker: Once failures hit threshold, stop calling the slow service
• Semaphore isolation: Limit concurrent calls to a specific service

Microservices Architecture — Complete Interview Guide Page 24


# Bulkhead — separate thread pool per service
[Link]-service:
maxConcurrentCalls: 10 # max 10 concurrent calls to payment
maxWaitDuration: 100ms # wait max 100ms for a slot

[Link]-service:
maxThreadPoolSize: 10
coreThreadPoolSize: 5
queueCapacity: 20

Q31. What is a load balancer? How does it work internally?

A load balancer distributes incoming requests across multiple instances of a service to prevent any single
instance from being overwhelmed. It also performs health checks and removes unhealthy instances from
rotation.

Types of load balancers:


• Layer 4 (Transport): Routes by IP+Port (TCP/UDP). Fast but no visibility into HTTP content.
Example: AWS NLB
• Layer 7 (Application): Routes by URL path, headers, cookies. Can do sticky sessions, A/B testing.
Example: AWS ALB, Nginx

Load balancing algorithms:

Algorithm Description Use case

Round Robin Requests distributed evenly in Homogeneous instances


rotation

Least Connections Send to instance with fewest active Variable request duration
connections

Weighted Instances get traffic proportional to Canary deployments


weight

IP Hash Same client IP always goes to Sticky sessions


same instance

Random Random instance selection Simple stateless services

Q32. How do you implement horizontal scaling?

Horizontal scaling means adding more instances of a service (scale out) instead of making one instance
bigger (scale up). In Kubernetes, this is done via Horizontal Pod Autoscaler (HPA).

Microservices Architecture — Complete Interview Guide Page 25


# Kubernetes HPA — auto-scale order-service pods
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-service
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # scale up when CPU > 70%

# Service must be STATELESS for horizontal scaling


# Store sessions in Redis, not in-memory
# No local file system — use S3

Industry Use: During flash sales (Flipkart Big Billion Days), Order Service scales from 5 pods to 50 pods
automatically within seconds based on CPU/request-rate metrics.

Q33. What is the difference between vertical and horizontal scaling?

Vertical Scaling (Scale Up) Horizontal Scaling (Scale Out)

• Add more CPU/RAM to same machine • Add more instances/pods


• Simple — no code changes • Requires stateless design
• Has hardware limits (max machine size) • No theoretical limit
• Single point of failure • High availability (if one dies, others serve)
• Downtime during upgrade • No downtime
• Example: 4 core → 16 core VM • Example: 2 pods → 20 pods

Vertical vs Horizontal Scaling

Q34. How do you scale the database in high-load systems?

Read replicas (most common):


• Primary DB handles all writes; replicas handle all reads
• Add as many replicas as needed for read throughput
• Spring: use @Transactional(readOnly=true) to route to replica

Connection pooling (HikariCP):

Microservices Architecture — Complete Interview Guide Page 26


# [Link]
[Link]-pool-size: 50
[Link]-idle: 10
[Link]-timeout: 30000

Caching (Redis):
• Cache frequently-read data (product catalog, user profiles) → 95% of reads never hit DB

Sharding:
• Partition data horizontally (e.g., users 1-1M on shard-1, 1M-2M on shard-2)

CQRS:
• Separate write model (normalized DB) from read model (denormalized, search-optimized)

Q35. How do you manage auto-scaling during traffic spikes?

• Kubernetes HPA: Scale pods based on CPU, memory, or custom metrics (requests/second via
Prometheus)
• KEDA: Kubernetes Event-Driven Autoscaler — scale based on Kafka consumer lag, queue depth
• Predictive scaling: Pre-scale before known events (Diwali sale, IPL match) using scheduled HPA
• Rate limiting: Protect backend from overload using API Gateway rate limiting
• Queue buffering: Accept requests into Kafka/SQS queue, process at sustainable rate

# KEDA — scale based on Kafka consumer lag


apiVersion: [Link]/v1alpha1
kind: ScaledObject
metadata:
name: order-consumer-scaler
spec:
scaleTargetRef:
name: order-consumer
triggers:
- type: kafka
metadata:
topic: order-events
lagThreshold: "100" # scale up if lag > 100 messages
bootstrapServers: kafka:9092
consumerGroup: order-service

Q36. How do you design a cloud-native application?

A cloud-native application is designed to run in cloud environments with full advantage of cloud
capabilities. The Twelve-Factor App methodology guides cloud-native design:

Factor Practice

1. Codebase One codebase in version control, many deploys

2. Dependencies Explicitly declare ([Link], [Link])

Microservices Architecture — Complete Interview Guide Page 27


3. Config Store in environment variables, not code

4. Backing services DB, Redis as attached resources (swap without code change)

5. Build/Release/Run Strictly separate build, release, run stages

6. Processes Stateless processes — no sticky sessions

7. Port binding Export services via port (embedded server)

8. Concurrency Scale by adding processes (horizontal)

9. Disposability Fast startup/shutdown — pods spin up in seconds

10. Dev/Prod parity Dev, staging, prod as similar as possible

11. Logs Write to stdout — platform collects and routes

12. Admin processes Run admin tasks as one-off processes

Q37. What is Cloud Native?

Cloud Native is an approach to building and running applications that fully exploits the advantages of cloud
computing (elasticity, distributed services, managed infrastructure). Core pillars:

• Containers: Package app + dependencies (Docker) — run anywhere


• Microservices: Decomposed services — deploy/scale independently
• DevOps: CI/CD pipelines — deploy multiple times per day
• Continuous Delivery: Always in a deployable state
• Dynamic orchestration: Kubernetes manages container lifecycle
• Observability: Metrics, logs, traces built-in (not an afterthought)

Q38. How do you ensure fault tolerance in microservices?

Fault tolerance means the system continues functioning (possibly in degraded mode) even when some
components fail. Key strategies:

• Redundancy: Run multiple instances of every service — no single point of failure


• Circuit breaker: Stop calling failing services fast (Resilience4j)
• Retry + idempotency: Safely retry transient failures
• Graceful degradation: Return partial/cached response when dependency is down
• Health checks + auto-restart: Kubernetes liveness/readiness probes restart unhealthy pods
• Multi-AZ deployment: Deploy in multiple availability zones — if one datacenter fails, others serve
traffic
• Chaos engineering: Deliberately inject failures (Netflix Chaos Monkey) to find weaknesses

Q39. How do you ensure scalability in microservices?

• Stateless services: Store session in Redis, not in-process — enables horizontal scaling

Microservices Architecture — Complete Interview Guide Page 28


• Async processing: Kafka queues absorb traffic spikes — process at sustainable rate
• Caching: Redis for hot data — prevents DB from being bottleneck
• Database read replicas: Scale reads independently from writes
• CDN: Serve static assets from edge nodes globally
• Kubernetes HPA: Auto-scale pods based on demand
• CQRS: Separate read/write models — each scales independently
• API Gateway rate limiting: Protect backend from being overwhelmed

Q40. How do you handle high traffic (100 to 1000 users)?

Traffic scaling strategy progresses through layers:

100 users: Single instance + DB pooling (HikariCP) + basic Redis cache

■ 500 users: Add Redis cache for hot data, add DB read replica

■ 1,000 users: Horizontal scaling (2-3 pods), load balancer

■ 10,000 users: Full horizontal scaling + CDN + async Kafka processing

■ 100,000 users: Multi-region + DB sharding + dedicated caching cluster

■ 1,000,000 users: Global CDN + Event sourcing + Cassandra/DynamoDB

Industry Use: Don't over-engineer early. Uber, Instagram, Airbnb all started on a monolith. Scale
infrastructure as actual bottlenecks appear — profiling identifies where the actual limit is.

Q41. How do you monitor and debug production issues in microservices?

The three pillars of observability:


• Metrics (Prometheus + Grafana): Quantitative — CPU, memory, request rate, error rate, latency p99
• Logs (ELK / Splunk): Textual events — correlate by traceId/transactionId
• Traces (Jaeger / Zipkin): End-to-end request flow across services with timing

Production debugging workflow:

1. Alert fires: 'Error rate on Order Service > 5% for 5 minutes'

■ 2. Open Grafana: identify which endpoint is failing, from what time

■ 3. Open Jaeger: find traces with errors → see which service in chain failed

■ 4. Open Kibana: filter logs by traceId → read detailed error messages

■ 5. Check recent deployments: was new version deployed recently?

■ 6. Check downstream services: is payment/inventory service healthy?

■ 7. Fix + deploy → monitor metrics return to normal

Microservices Architecture — Complete Interview Guide Page 29


Q42. How do you debug a microservice when it becomes slow?

Step-by-step debugging:
• Step 1 - Metrics: Check CPU/memory in Grafana — is it resource-constrained?
• Step 2 - Latency breakdown: In Jaeger traces, which span is slow? DB call? Downstream service?
• Step 3 - DB queries: Enable slow query log — is there a missing index?
• Step 4 - Thread dump: /actuator/threaddump — are threads blocked on locks?
• Step 5 - GC pressure: /actuator/metrics/[Link] — excessive GC pauses?
• Step 6 - Connection pool: Check Hikari pool metrics — pool exhausted?
• Step 7 - Kafka lag: If consumer, check consumer group lag — is it falling behind?

# Check actuator metrics for slow endpoints


GET /actuator/metrics/[Link]
?tag=uri:/api/v1/orders&tag=status:200

# Get p99 latency:


GET /actuator/metrics/[Link]
?tag=quantile:0.99

# Thread dump — look for BLOCKED threads:


GET /actuator/threaddump

Q43. How do you identify bottlenecks in microservices?

• Distributed tracing (Jaeger): See which service/call contributes most to total latency
• Profiling (async-profiler, JFR): Find hot methods in JVM — CPU flame graphs
• DB slow query log: Queries taking > 100ms — add indexes or optimize
• Kafka consumer lag: Growing lag = consumer is bottleneck — add partitions or consumers
• CPU/memory metrics: Grafana shows resource saturation
• Queue depths: Long queues indicate downstream processing bottlenecks
• Load testing (Gatling/JMeter): Find breaking point before production

Q44. What is the on-call / incident management process?

Incident management is the structured process for detecting, responding to, and resolving production
outages:

1. DETECT: Alert fires (PagerDuty/OpsGenie) → on-call engineer paged

■ 2. ACKNOWLEDGE: Engineer acknowledges within 5 min (SLA)

■ 3. TRIAGE: Assess severity (P0=site down, P1=major feature down, P2=degraded)

■ 4. COMMUNICATE: Post in incident Slack channel — 'Investigating login failures'

■ 5. INVESTIGATE: Check dashboards, traces, logs — identify root cause

Microservices Architecture — Complete Interview Guide Page 30


■ 6. MITIGATE: Rollback deploy / scale up / enable feature flag / restart pods

■ 7. RESOLVE: Confirm metrics normal — update status page

■ 8. POST-MORTEM: Blameless RCA document — what broke, why, how to prevent

Industry Use: Google SRE handbook defines SLOs (Service Level Objectives) and error budgets. If
error rate > 0.1%, incident is triggered. Every major incident requires a public post-mortem (see AWS
outage reports).

Q45. How is work divided across teams? Code reviews and releases?

Team ownership (Domain-Driven Design):


• Team per service/domain: Order team owns Order + Cart services; Payment team owns Payment +
Refund
• You build it, you run it: Team owns deployment, monitoring, on-call for their service
• API contracts first: Teams define OpenAPI specs before implementation — enables parallel work

Code review process:


• Branch strategy: feature branches → PR → review → merge to main
• Minimum 2 approvals for production code; 1 for hotfixes
• Automated checks: unit tests, linting, Sonar quality gate must pass before merge

Release process:

Dev merges feature → CI pipeline runs (tests + build + scan)

■ Auto-deploy to dev environment → smoke tests

■ Manual approval for staging deploy

■ QA + performance testing on staging

■ Manual approval for production (or auto for mature teams)

■ Production deploy → monitor metrics for 30 minutes

■ Rollback if error rate spikes

Q46. Tools for logging, monitoring, tracing (ELK, Prometheus, Grafana, Zipkin,
Sleuth, Jaeger)

Tool Category Purpose

Prometheus Metrics collection Scrapes /actuator/prometheus, stores time-series data

Grafana Metrics Dashboards + alerts on Prometheus data


visualisation

Microservices Architecture — Complete Interview Guide Page 31


ELK Stack Centralised logging Elasticsearch+Logstash+Kibana: store, search, visualise
logs

Spring Sleuth Trace ID Auto-adds traceId/spanId to every log line and HTTP
propagation header

Zipkin/Jaeger Distributed tracing Visualise end-to-end request traces across services

Splunk Enterprise logging Commercial alternative to ELK — powerful search +


alerting

Micrometer Metrics facade Vendor-neutral metrics API — sends to


Prometheus/Datadog/New Relic

<!-- [Link] — add Sleuth + Zipkin -->


<dependency>
<groupId>[Link]</groupId>
<artifactId>spring-cloud-starter-sleuth</artifactId>
</dependency>
<dependency>
<groupId>[Link]</groupId>
<artifactId>spring-cloud-sleuth-zipkin</artifactId>
</dependency>

# [Link]
[Link]: 1.0 # trace 100% of requests
[Link]-url: [Link]

# Log output automatically includes traceId/spanId:


# INFO [order-service,abc123,def456] Processing order 42
# ^traceId ^spanId

Q47. What is Spring Boot Actuator and how to use it in microservices for
monitoring?

Spring Boot Actuator exposes production-ready endpoints for health, metrics, traces, and runtime info. In
microservices, it integrates with Prometheus/Grafana for metrics and Kubernetes for health probes.

Microservices Architecture — Complete Interview Guide Page 32


# [Link]
[Link]: health,info,metrics,prometheus
[Link]-details: always
[Link]: order-service
[Link]: production

# Kubernetes probes:
# livenessProbe: GET /actuator/health/liveness
# readinessProbe: GET /actuator/health/readiness

# Health response example:


# {
# 'status': 'UP',
# 'components': {
# 'db': {'status': 'UP', 'details': {'database': 'MySQL'}}
# 'redis': {'status': 'UP'}
# 'kafka': {'status': 'UP'}
# }
# }

Q48. What is Micrometer?

Micrometer is the metrics instrumentation library for JVM-based applications — the 'SLF4J for metrics'. It
provides a vendor-neutral API to record metrics (counters, gauges, timers, histograms) and exports them
to any monitoring backend.

@Service
public class OrderService {
private final Counter ordersCreated;
private final Timer orderProcessingTimer;
private final Gauge activeOrders;

public OrderService(MeterRegistry registry) {


ordersCreated = [Link]("[Link]")
.tag("status", "success")
.description("Total orders created")
.register(registry);

orderProcessingTimer = [Link]("[Link]")
.description("Time to process an order")
.register(registry);
}

public Order createOrder(OrderRequest req) {


return [Link](() -> {
Order order = [Link](new Order(req));
[Link]();
return order;
});
}
}

Microservices Architecture — Complete Interview Guide Page 33


Industry Use: Exposed at /actuator/prometheus and scraped by Prometheus every 15s. Grafana
displays orders/second, p99 latency, and error rates on dashboards.

Q49. How do you handle failures → circuit breakers, retries, fallback?

The complete resilience pattern stack with Resilience4j — typically applied in this order:

Request comes in

■ 1. TimeLimiter: timeout if call takes > 3 seconds

■ 2. CircuitBreaker: if OPEN → fail fast immediately (no call made)

■ 3. Retry: if call fails with transient error → retry up to 3 times

■ 4. Bulkhead: limit concurrent calls to protect downstream

■ 5. If all retries fail → Fallback method called

■ Fallback: return cached data, default response, or queue for later

@Service
public class ProductService {

@CircuitBreaker(name="inventory", fallbackMethod="fallback")
@Retry(name="inventory")
@TimeLimiter(name="inventory")
@Bulkhead(name="inventory")
public CompletableFuture<ProductDetail> getProduct(Long id) {
return [Link](() ->
[Link](id));
}

// Fallback — called when all resilience measures exhausted


public CompletableFuture<ProductDetail> fallback(Long id, Exception e) {
[Link]("Inventory unavailable for product {}, using cache", id);
ProductDetail cached = [Link]("product:" + id);
if (cached != null) return [Link](cached);
return [Link]([Link](id));
}
}

Q50. How do you design a system for read-heavy traffic (millions of users)?

Browser/Mobile → CDN (CloudFront) — serves static assets + cached API responses

■ CDN miss → API Gateway (rate limiting, auth)

■ API Gateway → Load Balancer → Service instances (horizontal scaled)

■ Service → L1 Cache (in-process Caffeine): check first (sub-ms)

Microservices Architecture — Complete Interview Guide Page 34


■ Cache miss → L2 Cache (Redis cluster): check next (1-2ms)

■ Cache miss → Database Read Replica (10-20ms)

■ Writes always go to Primary DB → replicated to replicas asynchronously

@Service
public class ProductCatalogService {

@Cacheable(value="products", key="#id",
unless="#result == null")
public Product getProduct(Long id) {
return [Link](id).orElseThrow();
}

@CacheEvict(value="products", key="#[Link]")
public Product updateProduct(Product product) {
return [Link](product);
}
}

# [Link] — Redis cache TTL


[Link]-to-live: 300s # 5 minute TTL
[Link]-null-values: false

Industry Use: For product catalog pages: 95%+ of traffic served from Redis or CDN. DB only handles
writes and cold cache misses. This architecture handles 500K+ req/sec at Instagram-scale.

Q51. What is the Hot Key Problem in Redis? How to solve it?

The Hot Key problem occurs when one Redis key receives disproportionately high traffic — all requests
are routed to the same Redis shard, creating a bottleneck. Example: product:iphone15 during launch day
— millions of reads per second.

Solutions:
• Local in-process cache (Caffeine): Cache hot keys in JVM memory — no Redis call at all
• Key replication: Store same data under product:iphone15:1, :2, :3 — randomise read key
• Read-through cache on multiple nodes: Use Redis Cluster and explicitly replicate hot keys
• Increase TTL: Hot keys stay cached longer — fewer DB refreshes

Microservices Architecture — Complete Interview Guide Page 35


// Solution: Local cache in front of Redis
@Service
public class ProductService {
// L1: JVM local cache (Caffeine) — extremely fast, no network
private LoadingCache<Long, Product> localCache = [Link]()
.maximumSize(1000) // top 1000 hottest products
.expireAfterWrite(30, [Link])
.build(this::loadFromRedis);

public Product getProduct(Long id) {


return [Link](id); // L1 hit → 0.01ms
}

private Product loadFromRedis(Long id) {


String cached = [Link]("product:" + id);
if (cached != null) return deserialize(cached);
return loadFromDb(id); // L3 fallback
}
}

Q52. What is Sharding and how does it help?

Sharding (horizontal partitioning) splits a large dataset across multiple database instances (shards) based
on a shard key. Each shard holds a subset of data. This scales storage and write throughput beyond a
single machine's limits.

Sharding strategies:
• Range sharding: users 1-1M → shard 1; 1M-2M → shard 2. Simple but can create hot shards.
• Hash sharding: hash(userId) % numShards → uniform distribution, but range queries are hard.
• Directory sharding: A lookup table maps each key to its shard — flexible but lookup table is a
bottleneck.

// Hash sharding example


public int getShardId(Long userId) {
return (int)(userId % NUM_SHARDS); // userId 42 → shard 42 % 4 = 2
}

public UserRepository getRepoForUser(Long userId) {


int shardId = getShardId(userId);
return [Link](shardId);
}

// Consistent hashing (used by Cassandra, Redis Cluster):


// Hash ring — add/remove shards without resharding all data

Industry Use: Instagram uses sharding by user ID for photos. Cassandra and MongoDB have built-in
sharding. For OLTP, Vitess (YouTube's DB sharding middleware) shards MySQL.

Q53. What is the difference between client-side and server-side service


discovery?

Microservices Architecture — Complete Interview Guide Page 36


Client-Side Discovery (Eureka + Ribbon) Server-Side Discovery (Kubernetes / AWS ALB)

• Client queries Eureka registry • Client calls a fixed DNS / load balancer
• Client selects instance (Ribbon load balance) • Load balancer queries registry internally
• Client makes direct HTTP call • Load balancer forwards to instance
• Examples: Netflix Eureka, Consul • Examples: K8s Service, AWS ALB
• Pros: Less infrastructure • Pros: Simpler clients
• Cons: Each client needs discovery logic • Cons: Extra hop, LB is bottleneck

Client-side vs Server-side Service Discovery

Q54. Design a scalable backend for an IoT-enabled ambulance tracking system.

This is a real-time location tracking system with high write throughput (thousands of ambulances updating
location every 5 seconds), low-latency reads (dispatchers and hospitals viewing live maps).

Architecture:

Ambulance GPS device → MQTT broker (Mosquitto/AWS IoT Core)

■ MQTT broker → Kafka topic 'ambulance-location' (high throughput)

■ Location Service consumes from Kafka:

- Updates Redis GEO (geospatial index) for live positions

- Writes to TimescaleDB (time-series DB) for history/analytics

■ Dispatcher Web App → WebSocket connection to Location Service

- Location Service pushes real-time updates via WebSocket

■ Hospital Dashboard → REST API: GET


/ambulances/nearby?lat=x&lng;=y&radius;=5km

- Serves from Redis GEO (sub-millisecond)

Microservices Architecture — Complete Interview Guide Page 37


// Redis GEO for real-time ambulance positions
@Service
public class LocationService {
@KafkaListener(topics="ambulance-location")
public void updateLocation(LocationEvent event) {
// Store in Redis GEO index
[Link]().add(
"ambulances",
new Point([Link](), [Link]()),
[Link]().toString()
);
// Push to WebSocket subscribers
[Link](
"/topic/ambulance/" + [Link](), event);
}

public List<Ambulance> findNearby(double lat, double lng, double km) {


return [Link]().radius(
"ambulances",
new Circle(new Point(lng, lat),
new Distance(km, [Link]))
).getContent().stream()...
}
}

Q55. Design a ticket management system using microservices.

Services:
• User Service: Auth, user profiles
• Event Service: Event catalog, seating charts
• Inventory Service: Seat availability, locking
• Booking Service: Ticket reservations, booking lifecycle
• Payment Service: Payment processing, refunds
• Notification Service: Email/SMS confirmations

Booking flow:

User selects seat → Inventory Service: LOCK seat (Redis, 10-min TTL)

■ User enters payment → Payment Service processes charge

■ Payment SUCCESS → Booking Service: creates confirmed booking

■ Inventory Service: marks seat as SOLD

■ Notification Service: sends ticket via email/SMS

■ If user doesn't pay within 10 min → Redis TTL expires → seat unlocked

Key design decisions:


• Prevent overselling: Redis atomic SETNX for seat lock — only one user holds lock

Microservices Architecture — Complete Interview Guide Page 38


• Queue virtual: Under high demand, queue users and process sequentially
• Read model: Elasticsearch for event search (full-text, filters)

Q56. Design a notification system using the Open/Closed Principle.

Open/Closed Principle: open for extension (add new notification channels), closed for modification
(existing channels don't change). Use Strategy pattern.

// Notification channel interface (Open/Closed)


public interface NotificationChannel {
boolean supports(NotificationType type);
void send(Notification notification);
}

// Concrete channels — add new ones without changing existing


@Component
public class EmailChannel implements NotificationChannel {
public boolean supports(NotificationType type) {
return type == EMAIL || type == ALL;
}
public void send(Notification n) { [Link](n); }
}

@Component
public class SMSChannel implements NotificationChannel {
public boolean supports(NotificationType type) {
return type == SMS || type == ALL;
}
public void send(Notification n) { [Link](n); }
}

// Notification dispatcher — delegates to all supporting channels


@Service
public class NotificationService {
@Autowired List<NotificationChannel> channels;

public void send(Notification notification) {


[Link]()
.filter(c -> [Link]([Link]()))
.forEach(c -> [Link](notification)); // Strategy pattern
}
}
// Add WhatsApp: just create WhatsAppChannel — ZERO changes to existing code

Q57. Design a price drop notification system for e-commerce.

User clicks 'Notify me when price drops' → Watch Service saves (userId, productId,
targetPrice)

■ Price Service updates price → publishes PriceChanged event to Kafka

■ Watch Service (Kafka consumer) receives PriceChanged event

Microservices Architecture — Complete Interview Guide Page 39


■ Query: find all watchers for this product where targetPrice >= newPrice

■ For each matching watcher → publish NotificationRequest to Kafka

■ Notification Service sends email/push notification

■ Remove or deactivate the watch record

@KafkaListener(topics="price-events")
public void onPriceChange(PriceChangedEvent event) {
List<PriceWatch> watchers = watchRepo
.findByProductIdAndTargetPriceGreaterThanEqual(
[Link](), [Link]());

[Link](watcher -> {
[Link]("notifications", [Link]()
.userId([Link]())
.message("Price dropped to Rs " + [Link]())
.channel([Link])
.build());
[Link](false);
[Link](watcher);
});
}

Q58. Design a real-time SMS notification system.

Producer service publishes to Kafka 'sms-requests' topic

■ SMS Worker (consumer) reads from Kafka

■ Deduplication: check Redis if this requestId was already sent (TTL 24h)

■ Template resolution: merge template with data (FreeMarker/Thymeleaf)

■ Provider selection: primary=Twilio, failover=AWS SNS

■ HTTP call to SMS provider API

■ SUCCESS: update DB status=SENT, store in Redis for dedup

■ FAILURE: retry 3x → DLQ for manual review

■ Delivery receipt: provider webhook → update status=DELIVERED

Scalability: Partition Kafka by mobile number (ensures order per recipient). Scale SMS workers
horizontally — each worker handles different partitions.

Q59. Design a scalable file upload system for large files (2GB+).

Large file uploads must not go through the application server — this wastes memory and bandwidth. Use
pre-signed URLs to upload directly to cloud storage.

Microservices Architecture — Complete Interview Guide Page 40


Client requests upload: POST /files/initiate {filename, size, contentType}

■ Upload Service generates S3 pre-signed URL (valid 1 hour)

■ Returns pre-signed URL to client

■ Client uploads directly to S3 using pre-signed URL (no server involved)

■ S3 triggers Lambda/webhook on upload completion

■ Upload Service receives completion callback → updates file metadata in DB

■ Trigger async processing: virus scan, thumbnail generation, transcoding

// Generate pre-signed URL


@GetMapping("/files/upload-url")
public UploadUrlResponse getUploadUrl(
@RequestParam String fileName,
@RequestParam String contentType) {

String key = [Link]() + "/" + fileName;

PresignedPutObjectRequest presigned = [Link](


r -> [Link](p -> p
.bucket("my-uploads")
.key(key)
.contentType(contentType))
.signatureDuration([Link](1)));

return new UploadUrlResponse([Link]().toString(), key);


}

// Client uploads: PUT presignedUrl (with file bytes — no auth headers needed)

Industry Use: Google Drive, Dropbox, and WhatsApp all use pre-signed URL pattern. The app server
only handles metadata — S3/GCS handles the actual bytes.

Q60. Explain Group Anagrams problem and its time complexity.

Group Anagrams: given a list of strings, group strings that are anagrams of each other. This is a common
coding interview problem that tests HashMap and sorting understanding.

Microservices Architecture — Complete Interview Guide Page 41


// Input: ["eat","tea","tan","ate","nat","bat"]
// Output: [["eat","tea","ate"],["tan","nat"],["bat"]]

public List<List<String>> groupAnagrams(String[] strs) {


Map<String, List<String>> map = new HashMap<>();

for (String s : strs) {


char[] chars = [Link]();
[Link](chars); // sort chars → canonical form
String key = new String(chars); // 'eat','tea','ate' all → 'aet'

[Link](key, k -> new ArrayList<>()).add(s);


}

return new ArrayList<>([Link]());


}

// Time: O(N * K log K) — N=strings, K=max string length (sort each string)
// Space: O(N * K) — storing all strings in HashMap

// Optimised key: count char frequency instead of sorting


// O(N * K) time — avoid sort

Q61. What is CQRS? How does data flow in CQRS (read vs write)?

CQRS (Command Query Responsibility Segregation) separates the write model (commands that change
state) from the read model (queries that read state). This allows each model to be independently optimised
and scaled.

WRITE SIDE:

Client → Command (CreateOrder) → Command Handler → Write DB (normalized


MySQL)

→ publishes DomainEvent (OrderCreated) to Kafka

READ SIDE:

Event Handler consumes OrderCreated → updates Read DB (denormalized,


Elasticsearch)

Client → Query → Query Handler → Read DB → fast response

Microservices Architecture — Complete Interview Guide Page 42


// Command (write)
@Service
public class OrderCommandHandler {
@Transactional
public String handle(CreateOrderCommand cmd) {
Order order = new Order(cmd);
[Link](order); // normalised write DB
[Link](new OrderCreatedEvent(order));
return [Link]();
}
}

// Read model projector


@Component
public class OrderProjector {
@EventListener
public void on(OrderCreatedEvent event) {
// Build denormalized view for fast reads
OrderView view = [Link]()
.id([Link]())
.userEmail([Link]([Link]()))
.productNames([Link]().stream()...)
.build();
[Link](view); // Elasticsearch / read replica
}
}

// Query (read)
@Service
public class OrderQueryHandler {
public List<OrderView> getOrdersForUser(Long userId) {
return [Link](userId); // fast, no joins
}
}

Industry Use: CQRS with Event Sourcing is used in financial systems (Axon Framework), e-commerce
dashboards, and any system with very different read/write patterns.

Q62. What are other approaches besides Redis and CDN for high traffic?

• Read replicas: Route read queries to replica DBs — scale reads infinitely
• Database query optimisation: Indexes, query plan analysis, N+1 elimination
• Elasticsearch: Dedicated search engine for full-text and complex queries
• Kafka: Async processing buffers traffic spikes — system accepts load and processes at its own pace
• gRPC: 7x faster than REST for inter-service calls (binary protocol, HTTP/2 multiplexing)
• Connection pooling (PgBouncer): Reduces DB connection overhead
• Denormalization: Pre-join data in read tables — avoid expensive JOINs at query time
• Pagination + lazy loading: Never return unbounded result sets

Q63. How to implement a custom Spring Security filter?

Microservices Architecture — Complete Interview Guide Page 43


Custom filters extend OncePerRequestFilter to ensure they run exactly once per request. Common use:
JWT validation, API key authentication, request logging.

@Component
public class ApiKeyFilter extends OncePerRequestFilter {

@Override
protected void doFilterInternal(
HttpServletRequest request,
HttpServletResponse response,
FilterChain chain) throws ... {

String apiKey = [Link]("X-API-Key");

if (apiKey == null || ![Link](apiKey)) {


[Link]([Link]());
[Link]().write("{\"error\": \"Invalid API key\"}");
return; // don't continue chain
}

// Set authentication in security context


ApiKeyAuthentication auth =
new ApiKeyAuthentication([Link](apiKey));
[Link]().setAuthentication(auth);

[Link](request, response); // continue


}
}

// Register in SecurityFilterChain
@Bean
public SecurityFilterChain filterChain(HttpSecurity http) throws Exception {
[Link](apiKeyFilter,
[Link]);
return [Link]();
}

Q64. What is P2P communication between microservices? What are its


problems?

P2P (point-to-point / direct) communication means Service A directly calls Service B over HTTP/gRPC
without going through any intermediary (no message broker). This is the default with Feign Client.

Problems with P2P:


• Tight coupling: Service A must know Service B's address/port
• Cascading failures: If B is down, A fails immediately
• Chatty communication: Too many small P2P calls create network overhead
• Hard to trace: No central point to observe all communications
• Security: Every service must authenticate every other service

Solution: Service Mesh (Istio/Linkerd) or API Gateway for cross-cutting concerns:


• Service mesh handles: mTLS, retries, circuit breaking, observability — as a sidecar proxy

Microservices Architecture — Complete Interview Guide Page 44


• Backend For Frontend (BFF): aggregates multiple P2P calls into one API per client type

Q65. What is traffic routing in microservices?

Traffic routing determines how requests are directed to service instances. It enables canary deployments,
A/B testing, blue-green deployments, and geographic routing.

Methods:
• API Gateway routing: Route by URL path (/api/v1 → v1 service, /api/v2 → v2 service)
• Header-based routing: X-Beta-User: true → route to new version
• Weighted routing: 90% to stable, 10% to canary (Istio VirtualService)
• Geographic routing: Indian users → Mumbai region; EU users → Frankfurt region

# Istio VirtualService — canary routing


apiVersion: [Link]/v1alpha3
kind: VirtualService
metadata:
name: order-service
spec:
http:
- match:
- headers:
x-canary-user:
exact: 'true'
route:
- destination:
host: order-service-v2 # canary users
- route:
- destination:
host: order-service-v1 # everyone else
weight: 90
- destination:
host: order-service-v2
weight: 10 # 10% traffic to new version

Q66. How do you handle inter-service communication failure?

Multi-layered failure handling:

• Timeout: Never wait more than 3-5 seconds for a response


• Retry: For transient failures, retry with exponential backoff
• Circuit Breaker: Stop calling if failure rate is high
• Fallback: Return cached or default data instead of error
• Async alternatives: Convert sync call to async via Kafka if possible
• Saga compensation: For multi-step workflows, undo completed steps
• Dead Letter Queue: For Kafka failures, preserve failed messages for replay
Industry Use: The key question: 'Can the user workflow complete without this service?' If yes →
fallback/degrade. If no (payment must succeed) → fail fast with clear error message.

Microservices Architecture — Complete Interview Guide Page 45


Q67. What is WebSocket? How does it differ from REST?

REST (HTTP) WebSocket

• Request-response model • Full-duplex persistent connection


• Client initiates every call • Server can push data anytime
• New TCP connection per request • One TCP connection, kept alive
• Stateless — no persistent connection • Stateful — session per connection
• Use for: CRUD, most APIs • Use for: real-time, bidirectional
• Overhead: HTTP headers on every call • Low overhead after handshake
• Example: GET /orders → JSON response • Example: Live sports scores, chat

REST vs WebSocket

// Spring WebSocket — STOMP over WebSocket


@Configuration
@EnableWebSocketMessageBroker
public class WebSocketConfig implements WebSocketMessageBrokerConfigurer {
@Override
public void configureMessageBroker(MessageBrokerRegistry registry) {
[Link]("/topic", "/queue");
[Link]("/app");
}
@Override
public void registerStompEndpoints(StompEndpointRegistry reg) {
[Link]("/ws").withSockJS();
}
}

// Push notification to all subscribers


@Autowired SimpMessagingTemplate messaging;
[Link]("/topic/orders", orderUpdate);

Q68. Design a 1-to-1 video call architecture. Design group video call
architecture.

1-to-1 Video Call (WebRTC P2P):

Alice's browser ←→ Signaling Server (WebSocket) ←→ Bob's browser

1. Alice sends SDP offer to Signaling Server

2. Signaling Server relays to Bob

3. Bob sends SDP answer back

4. Exchange ICE candidates (via STUN to discover public IPs)

5. P2P connection established — video streams directly Alice↔Bob

(No media passes through server — minimal server cost)

Microservices Architecture — Complete Interview Guide Page 46


Group Video Call (SFU — Selective Forwarding Unit):

Each participant → uploads ONE stream to SFU server

SFU server → selectively forwards streams to each participant

Each participant receives N-1 streams (from other participants)

Participant's client adapts based on bandwidth

Examples: [Link], Zoom, Google Meet use SFU architecture

Alternative: MCU (Mixes all streams server-side) — higher server CPU

Q69. What is a signaling server? What is a STUN/TURN server? How does


WebRTC work?

WebRTC components:

Component Role

Signaling Server Exchanges session metadata (SDP: codec, resolution) between peers. Not
standardized — use WebSocket. No media passes through.

STUN Server Helps peers discover their public IP/port (behind NAT). Client asks STUN:
'What's my public IP?' Fast, cheap.

TURN Server Relay server for when direct P2P connection fails (strict NAT/firewall). All
media relays through TURN. Expensive (bandwidth).

ICE Framework Tries connection methods in order: direct P2P → STUN → TURN. Uses
whichever works.

WebRTC flow:

1. Alice creates RTCPeerConnection with ICE servers (STUN/TURN URLs)

2. Alice creates SDP offer (describes media capabilities)

3. Signaling Server sends offer to Bob

4. Bob creates SDP answer, sends via Signaling Server

5. Both gather ICE candidates (network addresses via STUN)

6. ICE candidates exchanged via Signaling Server

7. ICE finds best path → P2P connection established

8. Media streams directly between browsers

Q70. What is HLD vs LLD?

Microservices Architecture — Complete Interview Guide Page 47


HLD (High-Level Design) LLD (Low-Level Design)

• Macro architecture view • Micro implementation view


• Services and their interactions • Class diagrams, API contracts
• Technology choices (Kafka, Redis, MySQL) • DB schema, entity relationships
• Data flow diagrams • Algorithm and data structure choices
• Scalability and fault tolerance strategy • Design patterns used (Factory, Strategy)
• Typically discussed with senior engineers • Discussed with all engineers in team
• 'What' you are building and why • 'How' exactly each component works

HLD vs LLD

Q71. Explain your E-commerce system architecture (HLD).

Services:

Mobile/Web Client

■ API Gateway (Kong/AWS API GW) — auth, rate limiting, routing

■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
■■■■■

■ User Service ■ Product Service ■ Search (Elastic) ■

■ Order Service■ Cart Service ■ Inventory Service■

■ Payment Svc ■ Shipping Svc ■ Notification Svc ■

■ Review Svc ■ Coupon Service ■ Analytics Service■

■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
■■■■■

■ Message Bus (Kafka) — async communication between services

■ Databases: MySQL (transactional), Redis (cache), Elasticsearch (search)

■ Storage: S3 (images, files)

■ Monitoring: Prometheus + Grafana + ELK + Jaeger

■ Kubernetes on AWS EKS — orchestrates all services

Key decisions:
• CDN for product images — reduces latency from 200ms to 5ms
• Redis cluster for cart, sessions, inventory locks
• Kafka for order→inventory→shipping→notification chain
• Read replicas for product catalog (10:1 read:write ratio)

Microservices Architecture — Complete Interview Guide Page 48


Q72. How do you design the Inventory Service?

Responsibilities:
• Track stock levels per product per warehouse
• Reserve/release stock atomically (prevent overselling)
• Notify when stock runs low

Key design:

-- DB Schema
CREATE TABLE inventory (
product_id BIGINT PRIMARY KEY,
warehouse_id BIGINT,
available INT, -- can be ordered
reserved INT, -- locked but not confirmed
sold INT,
updated_at TIMESTAMP
);

// Atomic reservation (prevents race condition)


@Transactional
public boolean reserve(Long productId, int quantity) {
// Optimistic lock: only update if available >= quantity
int updated = [Link](productId, quantity);
if (updated == 0) throw new InsufficientStockException();
return true;
}

// JPA native query with optimistic locking


@Modifying
@Query("UPDATE inventory SET reserved = reserved + :qty, " +
"available = available - :qty " +
"WHERE product_id = :id AND available >= :qty")
int reserve(Long id, int qty);

Q73. How to design a Cart Service?

Cart is a write-heavy, read-heavy service with high mutability (users constantly add/remove items). Redis
Hash is ideal — O(1) operations, TTL for abandoned carts.

Microservices Architecture — Complete Interview Guide Page 49


// Cart stored in Redis as Hash:
// Key: cart:{userId}
// Field: productId
// Value: {quantity, price, addedAt}

@Service
public class CartService {
private static final Duration CART_TTL = [Link](30);

public void addItem(Long userId, CartItem item) {


String key = "cart:" + userId;
String field = [Link]().toString();
[Link]().put(key, field, toJson(item));
[Link](key, CART_TTL); // reset TTL on activity
}

public Cart getCart(Long userId) {


Map<Object, Object> entries =
[Link]().entries("cart:" + userId);
return [Link](entries);
}

public void removeItem(Long userId, Long productId) {


[Link]().delete(
"cart:" + userId, [Link]());
}
}

Industry Use: Cart data lives in Redis for active users. On checkout, cart is read from Redis, validated
against inventory, and saved to DB for order creation. Abandoned carts (TTL expired) are eligible for
email recovery flows.

Q74. How to design a Notification Service with multi-channel support?


Design goals:
• Support Email, SMS, Push, WhatsApp from a single service
• Decouple from caller — receive events via Kafka
• Track delivery status per notification
• Template-based message generation
• Retry failed notifications

Any service → publish NotificationRequest to Kafka 'notifications' topic

{userId, type: EMAIL, template: 'order-confirmed', data: {orderId: 42}}

■ Notification Consumer reads from Kafka

■ Template Service: render FreeMarker template with data

■ Channel Router: EMAIL → EmailChannel, SMS → TwilioChannel

■ Provider sends message

Microservices Architecture — Complete Interview Guide Page 50


■ Update notification_log table (status, sentAt, deliveredAt)

■ Webhook from provider: update delivery status

Q75. What design patterns are used in a Notification Service (Strategy,


Factory)?

• Strategy Pattern: Each notification channel (Email, SMS, Push) is a strategy. The service delegates
to the right strategy at runtime based on notification type.
• Factory Pattern: NotificationChannelFactory creates the appropriate channel implementation based
on type.
• Template Method: Base send() method handles common steps (log, deduplicate); subclasses
implement channel-specific sending.
• Observer/Event-Driven: Notification service is a Kafka consumer — reacts to events from other
services.

// Factory + Strategy
@Component
public class NotificationChannelFactory {
@Autowired List<NotificationChannel> channels; // all @Component implementations

public NotificationChannel getChannel(ChannelType type) {


return [Link]()
.filter(c -> [Link](type))
.findFirst()
.orElseThrow(() -> new UnsupportedChannelException(type));
}
}

@Service
public class NotificationDispatcher {
@Autowired NotificationChannelFactory factory;

public void dispatch(NotificationRequest req) {


NotificationChannel channel = [Link]([Link]());
[Link](req); // Strategy: polymorphic dispatch
}
}

Q76. What is FreeMarker? How do you design email templates?

FreeMarker is a Java template engine that generates text output (HTML emails, SMS text) from templates
and data models. Templates are .ftl files with ${variable} substitution and FTL directives.

Microservices Architecture — Complete Interview Guide Page 51


<!-- templates/[Link] -->
<html><body>
<h2>Hi ${[Link]}!</h2>
<p>Your order #${[Link]} has been confirmed.</p>
<table>
<#list [Link] as item>
<tr>
<td>${[Link]}</td>
<td>Qty: ${[Link]}</td>
<td>Rs ${[Link]}</td>
</tr>
</#list>
</table>
<p>Total: Rs ${[Link]}</p>
</body></html>

// Java — render template


@Service
public class TemplateService {
@Autowired Configuration freemarkerConfig;

public String render(String templateName, Map<String,Object> data)


throws Exception {
Template template = [Link](templateName);
StringWriter writer = new StringWriter();
[Link](data, writer);
return [Link]();
}
}

Q77. How to confirm that 1000 notifications were sent successfully?


• Notification log table: Every notification written to DB with status (QUEUED → SENT →
DELIVERED / FAILED)
• Provider delivery receipts: Email (SMTP bounce handling), SMS (Twilio delivery webhook), Push
(FCM delivery callback)
• Batch job metrics: After bulk send, query: SELECT COUNT(*), status FROM notifications WHERE
batch_id=X GROUP BY status
• Grafana dashboard: Realtime chart: notifications_sent_total vs notifications_failed_total
• Alerts: Alert if failure rate > 5% in a batch

Microservices Architecture — Complete Interview Guide Page 52


@Scheduled(fixedDelay=60000) // every minute
public void reconcileDelivery() {
// Find SENT notifications not confirmed delivered in 1 hour
List<Notification> unconfirmed = notificationRepo
.findBySentAtBeforeAndStatus([Link]().minusHours(1), "SENT");

[Link](n -> {
DeliveryStatus status = [Link]([Link]());
[Link]([Link]());
[Link](n);
});
}

Q78. How to design a payment system?

1. User initiates payment → Payment Service creates PENDING PaymentRecord

2. Payment Service calls Payment Gateway (Razorpay/Stripe) API

3. Gateway redirects user to payment page (3DS if needed)

4. User completes payment on gateway's page

5. Gateway sends webhook: {status: SUCCESS, transactionId, amount}

6. Payment Service verifies webhook signature (HMAC-SHA256)

7. Update PaymentRecord to SUCCESS, publish PaymentSuccess event

8. Saga: Order Service, Inventory Service, Notification react to event

Key design decisions:


• Idempotency: Use gateway's idempotency key — never double-charge
• Webhook verification: Validate gateway signature before processing
• Timeout reconciliation: If webhook doesn't arrive in 15 min, poll gateway
• PCI DSS: Never store raw card numbers — use gateway's token vault
• Audit log: Immutable log of every state transition

Microservices Architecture — Complete Interview Guide Page 53


@PostMapping("/webhooks/payment")
public ResponseEntity<String> handleWebhook(
@RequestBody String payload,
@RequestHeader("X-Razorpay-Signature") String signature) {

// Verify webhook authenticity


if (![Link](payload, signature)) {
return [Link](401).body("Invalid signature");
}

PaymentWebhook webhook = [Link](payload, [Link]);


[Link](webhook);
return [Link]("Received");
}

Q79. How to handle a scheduler vs webhook race condition?

Race condition: A scheduled job polls the payment gateway for status at the same time the webhook
arrives. Both try to update the payment status simultaneously — can cause inconsistent state.

Solution: Optimistic locking + status machine transitions

@Entity
public class Payment {
@Version
private Integer version; // Optimistic locking

private String status; // PENDING → SUCCESS | FAILED


}

@Service
public class PaymentService {
@Transactional
public void updateStatus(Long paymentId, String newStatus) {
Payment payment = [Link](paymentId)
.orElseThrow();

// Only transition from PENDING — prevents overwriting SUCCESS


if (![Link]().equals("PENDING")) {
[Link]("Payment {} already processed: {}",
paymentId, [Link]());
return; // idempotent — no-op if already processed
}

[Link](newStatus);
[Link](payment);
// @Version increments — if concurrent update: OptimisticLockException
}
}
// Scheduler and webhook may both try → first one wins → second is no-op

Q80. What are the components in a Kafka-based microservices architecture?

Microservices Architecture — Complete Interview Guide Page 54


Component Role

Producer Service that writes events to Kafka topics (e.g., Order Service)

Consumer Service that reads and processes events (e.g., Inventory Service)

Topic Named channel for events (e.g., order-events, payment-events)

Partition A topic is split into partitions for parallelism. Each partition is ordered.

Consumer Group Group of consumers sharing a topic — each partition assigned to one
consumer. Enables parallel processing.

Offset Position of the last consumed message. Committed to track progress.

Broker A Kafka server. Multiple brokers form a Kafka cluster for HA.

ZooKeeper/KRaft Kafka cluster coordination (KRaft replaces ZooKeeper in Kafka 3.x)

Schema Registry Stores Avro/Protobuf schemas — ensures producer/consumer


compatibility

Kafka Connect CDC connectors: sync DB changes to/from Kafka (Debezium)

Q81. How do you ensure zero downtime deployments (blue-green / rolling)?

Rolling deployment (Kubernetes default):

Current: 3 pods running v1

■ K8s terminates 1 pod (v1) + starts 1 pod (v2)

■ Wait for v2 pod to pass readiness probe

■ Repeat until all pods are v2

■ At any time: mix of v1 + v2 pods serving traffic

Zero downtime: LB removes unhealthy pods before termination

Blue-Green deployment:

Blue (v1): production traffic (100%)

Green (v2): deploy + test on separate environment

■ Tests pass → switch load balancer from Blue to Green (instant cutover)

■ Blue stays running for 10 mins as fallback

■ If issues → switch back to Blue (instant rollback)

■ After confidence: decommission Blue

Microservices Architecture — Complete Interview Guide Page 55


# Kubernetes rolling update config
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0 # never take pods below desired count
maxSurge: 1 # allow 1 extra pod during update
template:
spec:
containers:
- name: order-service
image: order-service:v2.0.1
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5

Q82. What is Canary Deployment vs Blue-Green Deployment?

Canary Deployment Blue-Green Deployment

• Route small % of traffic to new version • Two identical environments: Blue (prod) + Green
(1-10%) (new)
• Gradual rollout based on metrics • 100% traffic switch at once
• If metrics OK → increase %; if bad → rollback • Instant rollback: switch back to blue
• Good for risky changes — real user validation • Good for schema migrations, breaking changes
• Tools: Istio VirtualService, AWS CodeDeploy • Tools: AWS ALB weighted TG, Kubernetes
• Cost: run multiple versions in parallel • Cost: 2x infrastructure during switch
• Risk: lower — affects small user subset • Risk: all users affected if issue found late

Canary vs Blue-Green

Q83. What is Observability in microservices?

Observability is the ability to understand the internal state of a system from its external outputs. The three
pillars: Metrics, Logs, Traces. Together they answer: 'What's wrong, where, and why?'

Pillar Answers Tools

Metrics Is the system healthy? What are the Prometheus, Micrometer, Grafana
numbers? (request rate, error rate,
latency)

Logs What happened? Detailed event record ELK Stack, Splunk, Loki
with context (userId, orderId, error stack
trace)

Microservices Architecture — Complete Interview Guide Page 56


Traces How did this request flow through the Jaeger, Zipkin, AWS X-Ray
system? Which service caused the
latency?

Golden Signals (Google SRE): Latency, Traffic, Errors, Saturation — the minimum you must
monitor.

Q84. What is the ELK Stack? How does Centralized logging work?

Microservice writes log to STDOUT (or file)

■ Log shipper (Filebeat/Fluentd) reads logs and sends to Logstash

■ Logstash: parses, enriches, filters log lines → structured JSON

Example: extract traceId, userId, severity from log string

■ Elasticsearch: stores and indexes structured log data

■ Kibana: query, visualise, alert on logs

Dashboard: 'All ERROR logs in last 1h for order-service'

# Logback config — structured JSON logs


<configuration>
<appender name="STDOUT" class="[Link]">
<encoder class="[Link]">
<!-- Outputs JSON: timestamp, level, message, traceId, spanId -->
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="STDOUT" />
</root>
</configuration>

# Output: {
# '@timestamp': '2024-01-15T[Link]',
# 'level': 'ERROR',
# 'message': 'Payment failed for order 42',
# 'traceId': 'abc-123',
# 'service': 'order-service'
# }

Q85. How do logs get added to Splunk?

Application writes logs to STDOUT / log file

■ Splunk Universal Forwarder (agent) installed on each host/container

■ Forwarder monitors log file / Docker stdout in real-time

Microservices Architecture — Complete Interview Guide Page 57


■ Forwarder sends raw log lines to Splunk Indexer (via TCP)

■ Splunk Indexer: parses, tokenizes, and indexes log data

■ Splunk Search Head: provides UI for searching, dashboards, alerts

In Kubernetes: Splunk Connect for Kubernetes (DaemonSet) collects all pod logs

Splunk query example (SPL):

| index=production service=order-service level=ERROR


| timechart count by exception_class

| index=production traceId=abc-123
| sort by _time | table _time service level message

Q86. How does distributed logging work using Correlation IDs?

A Correlation ID (traceId) is a unique ID assigned at the entry point (API Gateway) and propagated
through all service calls via HTTP headers and Kafka message headers. Every log line includes this ID —
enabling you to find all logs for one user request across all services.

API Gateway: assigns traceId='abc-123', adds to X-Trace-Id header

■ Order Service receives request, reads traceId from header

Adds to MDC (Mapped Diagnostic Context) → all log lines include traceId

Passes traceId to downstream: X-Trace-Id: abc-123

■ Payment Service receives traceId, adds to MDC, logs all steps

■ Kafka producer adds traceId to message headers

■ Notification Service reads traceId from Kafka headers, adds to MDC

Kibana query: traceId:'abc-123'

→ Shows ALL log lines from ALL services for this one request

// Feign client interceptor — propagate traceId


@Component
public class TraceIdInterceptor implements RequestInterceptor {
public void apply(RequestTemplate template) {
String traceId = [Link]("traceId");
if (traceId != null) {
[Link]("X-Trace-Id", traceId);
}
}
}

Microservices Architecture — Complete Interview Guide Page 58


Industry Use: Spring Sleuth auto-handles this: it auto-instruments Feign, RestTemplate, Kafka, and
@Async — propagating traceId everywhere without manual code.

Q87. What is the difference between Grafana and Prometheus?

Prometheus Grafana

• Time-series database • Visualisation and dashboard platform


• Collects (scrapes) metrics from apps • Reads from Prometheus, Elasticsearch, etc
• Stores metrics with labels • Creates graphs, charts, heatmaps
• Powerful query language: PromQL • Does NOT collect metrics itself
• Alerting rules (Alertmanager) • Alert notifications (Slack, PagerDuty)
• Pull model: scrapes /metrics endpoint • Shows real-time and historical data
• Does NOT visualize — only stores+queries • Industry-standard dashboard tool

Prometheus stores; Grafana visualises

# PromQL — query P99 latency for order-service


histogram_quantile(0.99,
sum(rate(http_server_requests_seconds_bucket{
application='order-service',
uri='/api/v1/orders',
status='200'
}[5m])) by (le)
)

# Alert rule in Prometheus


groups:
- name: order-service
rules:
- alert: HighErrorRate
expr: rate(http_server_requests_total{status='5xx'}[5m]) > 0.05
for: 2m
annotations:
summary: 'Error rate > 5% on order-service'

Microservices Architecture — Complete Interview Guide Page 59

You might also like