0% found this document useful (0 votes)
6 views10 pages

Problem Statements

The document outlines several summer project problem statements focused on digital lending, customer value retention, airline loyalty programs, and delivery ETA optimization. Each project includes industry context, business challenges, specific problem statements, key questions to address, data design guidelines, and expected deliverables. Participants are expected to leverage data-driven approaches to develop actionable insights and recommendations for the respective industries.

Uploaded by

Gauransh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Problem Statements

The document outlines several summer project problem statements focused on digital lending, customer value retention, airline loyalty programs, and delivery ETA optimization. Each project includes industry context, business challenges, specific problem statements, key questions to address, data design guidelines, and expected deliverables. Participants are expected to leverage data-driven approaches to develop actionable insights and recommendations for the respective industries.

Uploaded by

Gauransh Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Summer Projects ‘26

Problem Statements

Digital Lending: Portfolio Optimization

Decoding Customer Value: A SQL-Driven


Retention Strategy

Unlocking Behavioral Intelligence in


Airline Loyalty Programs

Optimizing Delivery ETAs with Graph-


Based Network Intelligence
consulting

Digital Lending:
Portfolio Optimization
Industry Context
Digital lending institutions in emerging markets have witnessed rapid growth by leveraging technology-
driven customer acquisition, alternative data sources, and faster credit decisioning. These lenders cater
to a diverse customer base across urban and semi-urban regions, including first-time borrowers and
underbanked segments, through products such as unsecured personal loans, SME working capital loans,
and buy-now-pay-later (BNPL) offerings.
While scale and speed have enabled strong topline growth, evolving borrower behavior, macroeconomic
volatility, and rising acquisition costs have increased pressure on portfolio quality and profitability. Many
lenders now face the challenge of balancing growth ambitions with prudent risk management, while
effectively leveraging the rich data generated across the customer lifecycle.

Business Challenge
Despite having access to extensive customer, loan, repayment, and behavioral data, lenders often
struggle to translate this information into actionable insights for strategic decision-making. Key issues
include limited visibility into risk drivers, delayed identification of potential delinquencies, suboptimal
pricing and product structures, and fragmented performance monitoring for leadership.
The central challenge is to design a data-driven approach that enables lenders to understand customer
risk and value holistically, detect early signs of financial stress, optimize growth strategies without
compromising portfolio quality, and support leadership with clear, forward-looking insights.

Problem Statement
How can a digital lending institution leverage end-to-end customer and transaction data to strengthen
credit risk assessment, enable early detection of delinquencies, optimize pricing and product strategies,
and drive sustainable, risk-adjusted portfolio growth?

Key Questions to Address


Participants are expected to explore and answer the following:
1. Which customer segments exhibit materially different risk and repayment behaviors, and what
attributes define these segments?
2. How do acquisition channels and onboarding strategies impact portfolio quality, customer lifetime
value, and unit economics?
3. Which loan products, ticket sizes, and tenures deliver the strongest balance between growth and risk?
4. How can pricing, approval, or tenure strategies be tailored across segments to improve overall
portfolio outcomes?
5. What performance metrics and views should senior leadership monitor to proactively manage risk and
growth?
consulting

Data Design Guidelines


Participants are required to design and generate their own synthetic dataset that reflects a realistic
digital lending ecosystem. The dataset should capture the full customer journey and allow meaningful
analysis of risk, growth, and performance across the following areas:

Data Area Core Information to Include Expected Relationship

Geography, income proxy, employment type, Weaker profiles tend to exhibit higher
Customer Profile credit quality indicator risk

Product type, loan size, tenure, pricing, origination Higher risk may result in tighter terms or
Loan & Product risk grade higher pricing

Repayment Payment timeliness, delinquency status, partial Deteriorating payment behavior


Behavior payments precedes default

Behavioral Cash flow consistency, balance volatility, Higher volatility may result in higher
Signals spending shocks delinquency likelihood

Faster/cheaper growth may carry higher


Acquisition Channel, cost of acquisition, approval turnaround
risk

Time Dimension Origination and repayment periods Newer cohorts may behave differently

Certain segments outperform on risk-


Outcomes Loan status, default indicator, value proxy
adjusted basis

Deliverables
The final submission consists of two components: a Credit Policy Recommendation Report and a
Presentation deck.
1. The report should be 4 to 6 pages in length, addressed to a fictional Chief Risk Officer. It must
include an executive summary, risk segmentation findings, early warning signal logic, and a specific
policy recommendation accompanied by a projected impact assessment. This format mirrors what a
strategy analyst at a fintech or a consulting associate on a financial services engagement would be
expected to produce.
2. The presentation deck should complement the report and communicate the team's solution to a
stakeholder audience. Its structure and design are intentionally left to the participants' discretion; we
are more interested in the quality of thinking than in adherence to a prescribed format.
Across both components, the standard being applied is this: the work should lead with strategic
recommendations that are substantiated by data, not data outputs in search of a conclusion.

Point of Contact
Sanchit Pendharkar - 6363679144
Prakhar Gupta - 9235527628
SQL | consulting

Decoding Customer Value:


A SQL-Driven Retention Strategy
Background
A direct-to-consumer (D2C) fashion brand sells clothing, accessories, footwear, and outerwear across the
United States. The brand has no physical stores and no third-party retailers — every customer relationship
is managed directly by the brand itself.
The brand has grown steadily and now has customer behavioral data covering around 3,900 customers.
It runs a promotional discount program, supports multiple payment methods, and offers a range of
shipping options. But it has never built a structured way to understand its customers beyond surface-level
sales numbers.
The founding team is at a critical point: they can keep running promotions reactively and hope customers
keep coming back — or they can build something more deliberate. They have chosen to build.

Core Problem
The brand has data but no intelligence built on top of it. Specifically, it cannot answer:
• Who are the customers likely to still be buying two years from now, and what do they look like today?
• Is the discount and promo program actually building a loyal customer base, or just attracting one-time
bargain hunters?
• Which product categories are associated with lower purchase history, and which ones appear most
among high-frequency, high-tenure customers?
• Are there cities or regions where the brand has strong traction that it has not yet deliberately targeted,
and does that traction look different in terms of category preference, promo sensitivity, or average spend
per customer?
• What does the brand's best customer actually look like in terms of age, purchase habits, payment
preferences, and satisfaction?
Without answers to these questions, the brand is making marketing and product decisions based on gut
feel rather than evidence.

Problem Statement
Using only transactional and behavioral data, can the brand identify what its most valuable customers
look like, measure how much of its current revenue depends on promotions, and build a data-backed
retention strategy that reduces discount dependency without hurting sales?

Key Questions to Address


1. Who are the genuinely loyal customers vs. those who only buy when there is a discount?
2. What behavioral patterns today predict high customer value over time?
3. Which geographies and demographics are commercially underlevered?
4. How should the brand restructure its promotional strategy to protect margins without losing volume?
5. What does the brand's ideal customer profile look like, and how can it acquire more of them?
SQL | consulting

Scope of Analysis
1) Data Preparation & Feature Engineering (Python):
Clean and prepare the raw dataset. Then build the customer-level metrics the brand needs — but here
is the constraint: you are not told which metrics to build. You need to decide what to measure and
justify why those choices make sense for this business.
As a starting point, think about what signals in the data could indicate value, satisfaction, and
promotional reliance. But do not stop at computing numbers, explain the logic behind each metric you
create and what it is actually trying to capture.
One thing to keep in mind: metrics that sound analytical but do not lead to a decision are not useful.
Every engineered feature should answer a question the brand actually cares about.

2) Customer Segmentation & Analysis (SQL)


Build a structured query layer to answer the brand's core business questions:
What separates high-value customers from low-value ones, and which profiles show the strongest
repeat purchase behavior?
Which seasons and categories are associated with lower-tenure customers versus those with high
previous purchase counts?
Which geographies signal organic demand versus discount-driven volume?

3) Founder Dashboard (Power BI)


Build a clean, four-panel dashboard designed for a non-technical founding team:
• Customer pyramid: showing how value is distributed across the customer base
• Promo dependency vs. retention rate: plotted by segment to show who needs discounts to buy and
who doesn't
• Geographic opportunity map: not just which regions buy most, but which regions show high spend and
low promo dependency, indicating genuine brand pull rather than discount-driven demand
• Category funnel: showing which product categories are associated with low purchase history versus
high purchase history, as a proxy for entry-point versus retention categories

4) Retention Playbook (Business Recommendations)


Translate your findings into two outputs the brand can act on. Every recommendation must state the
trade-off clearly — not just what to do, but what you risk by doing it.
Promotional sunset plan: Identify which segments to gradually stop discounting, why those segments
specifically, and what the margin impact looks like. The recommendation must name the segment, the
trigger behavior, the rollout timeline, and the metric you would track to know if it is working. "Reduce
discounts" is not an acceptable answer.
Ideal customer profile: A data-backed description of the brand's most valuable customer type, specific
enough that a marketing team could use it to make targeting decisions today.

Dataset
Link
SQL | consulting

The Central Analytical Challenge


The Core Constraint
The dataset has no loyalty score, no churn label, and no timestamps. Every concept you use must be
constructed from available variables, not assumed.
Loyalty must be defined, not declared. You are required to build at least two competing definitions using
different variable combinations, test both, and argue clearly for one. Your justification must be grounded
in internal consistency, correlation with revenue, or predictive logic — not intuition.
Every segment claim must be traceable. Labels like "at-risk" or "high-value" are only valid if they map to a
specific, stated combination of variables. Assertions that cannot be traced back to the data will not meet
the standard of this project.

Expected Outcome
A comprehensive customer intelligence report and interactive dashboard designed to provide the brand
with a clear and actionable answer to a critical strategic question:
“Is the business successfully building a loyal customer base, or is it reliant on continuous promotional
activity? Furthermore, what strategic actions should be taken under either scenario?”

Deliverables
Cleaned dataset + engineered features (dependency score, value
Python
tier, satisfaction flag)

SQL Segmentation queries answering all 5 key questions above

Power BI Four-panel founder dashboard as described in scope

Playbook Promo sunset plan + ideal customer profile

A concise executive summary (max 1 page) of findings and


Summary
recommendations

Point of Contact
Achyuth - 6381774762
Dhairya Nisar - 8928149400

Resources
Link
strategy| data analytics

Unlocking Behavioral Intelligence in


Airline Loyalty Programs
Background
Airlines use loyalty programs to increase repeat business and improve Customer Lifetime Value (CLV).
However, many programs still focus mainly on points and rewards. As a result, many members stay
inactive, high-value customers may stop flying without being noticed, and marketing teams often react
too late. Airlines now need a more proactive and data-driven loyalty strategy.

The Challenge
You are part of the Business Analytics team of a mid-sized airline. You will work with data for around
16,700 Canadian loyalty members from 2012 to 2018.
Your goal is to build a solution that helps the marketing team identify customers likely to disengage,
understand which members are most valuable, and recommend actions to improve retention. The main
challenge is not only building a model, but also turning insights into clear business actions.

Core Objectives
1) Churn Prediction
The first objective is to identify members who are likely to stop engaging with the airline in the coming
[Link] dataset contains formal cancellation records, but cancellation is not the only form of churn
that matters. Some customers may stop flying for a long period, while others may remain enrolled in the
loyalty program but stop earning or redeeming points.
You are expected to define what churn means in your analysis and explain the reasoning behind your
definition. There is no single correct answer, but your approach should be logical, realistic, and easy to
defend.
An important challenge in this task is avoiding the use of future information. Your model should only use
data that would have been available at the time of prediction. How you define churn, create features, and
avoid data leakage will be an important part of the evaluation.

2) Customer Segmentation & Value


The business already tracks CLV. Your job is to interrogate whether that metric is telling the full story.
What makes a customer genuinely valuable to an airline, not just historically, but going forward? How do
behavioral signals (flight frequency, redemption patterns, seasonal activity) interact with demographic
ones (salary band, education, marital status, card tier)? Surface the patterns that a marketing manager
looking at a summary report would miss.
Your segmentation framework will be evaluated on whether it produces segments that are meaningfully
distinct and actionable, and not just statistically separable.
strategy| data analytics

3) Smart Retention
For each segment you identify, propose specific interventions. Vague recommendations ("offer bonus
points to at-risk members") are not appreciated. A strong recommendation is one that a non-technical
manager could hand to an operations team tomorrow; it would specify who receives it, when, why, and in
what form.
How you structure the logic connecting churn risk to retention action is an open design problem. Elegant
solutions will be recognized.

The Dataset - Link


Four integrated files:
Customer Loyalty History — Demographics, loyalty tier, CLV, and cancellation records
Customer Flight Activity — Monthly time-series of flights booked, distance traveled, and points
accumulated/redeemed
Calendar — Date dimension mapping months to quarters and seasons
Data Dictionary — Full column definitions
The data is real-world in character: it contains anomalies, inconsistencies, and edge cases. Part of your
analytical work is identifying these, documenting the decisions you make to address them, and ensuring
those decisions do not compromise your model.

Deliverables
1) Working Prototype
A functional interface (any tool: Streamlit, Power BI, Tableau, Excel, or other) that a non-technical
marketing manager can open and use without guidance. It must close the gap between your model's
output and a concrete business action. What that interface looks like, and what it prioritizes, is your
design call.
Standard: A first-time user should be able to identify who needs attention and what to do about it
without reading a manual.
2) Technical Report (6–8 pages)
Cover: how you framed the problem, what cleaning decisions you made and why, the logic behind your
model and segmentation choices, the most interesting things you found, and three business
recommendations that a CFO and a CMO could both get behind. The report is written for a smart non-
technical reader, not a peer reviewer.

Point of Contact
Nalin goel - 8289031644
Keerthana - 9019647256
machine learning | consulting

Optimizing Delivery ETAs with


Graph-Based Network Intelligence
Background
Delhivery is India's largest fully-integrated logistics provider, operating a vast network of facilities, inter-
city routes, and last-mile delivery across every major state. At the core of its operations is a hub-and-
spoke model: shipments travel from a source facility through one or more intermediate hubs before
reaching the destination, each hop is a segment of a multi-leg journey.
To estimate delivery times, Delhivery uses OSRM, a standard routing engine that assumes clean traffic and
shortest paths. But real-world logistics is far messier: congestion, facility dwell time, seasonal volume
spikes, and route-type constraints all cause actual delivery times to deviate significantly from predictions.

The Challenge
The strategic question: Delhivery's OSRM system underestimates actual delivery time on a significant
fraction of routes. Can a graph-based model, one that treats the logistics network as a connected graph
of facilities and corridors, not a collection of independent point-to-point estimates, produce more
accurate ETAs and identify which corridors and hubs are systematically causing delays?
When ETA is wrong, SLAs are missed and customers are unhappy
Downstream capacity planning breaks down across the network
There is currently no systematic way to identify which hubs and corridors are the biggest contributors to
delays
Route-type decisions (FTL vs Carting) are made without accounting for graph position or structural risk of a
facility

Your Mission
As a data science team, your goal is to build a graph-based intelligence system for Delhivery's logistics
network. You will model the entire network as a directed graph, facilities as nodes, corridors as edges, and
use this structure to produce smarter ETA predictions, surface bottleneck hubs, and generate actionable
recommendations for the Head of Network Operations.
Your final output is not just a model, it is a consulting deliverable: a strategy memo that a real operations
leader could act on.

What we expect from you


Technical rigor, Clean, reproducible code with clear documentation. Model choices should be
justified, not just applied
Analytical thinking, Go beyond running the model. Interpret what the graph metrics mean for real
operations. Why is a particular hub a bottleneck? What would fixing it cost vs. save?
Business communication, The strategy memo must be written for an operations leader, not a data
scientist. No raw model outputs, translate findings into decisions
Creativity, Suggest additional insights, visualizations, or analyses beyond the provided scope if you
spot something interesting in the data
machine learning | consulting

Key Tasks and Deliverables


1. Graph construction & data pipeline -Parse and merge raw trip segments into a directed weighted
graph where edge weights capture the median actual-vs-OSRM delay ratio per corridor, stratified by
route type and time of day. Code must be clean, reproducible, and model choices justified - not just
applied.
2. Bottleneck & corridor audit - Compute betweenness centrality, in/out-degree, and clustering
coefficients to identify critical chokepoint hubs and chronically delayed corridors (actual time
exceeds OSRM by >20%). Visualize and rank by SLA breach contribution.
3. Graph-enhanced ETA prediction model - Build and benchmark a baseline regression (trip-level
features) against a GraphSAGE/node2vec-enhanced model. The graph model must demonstrably
outperform the baseline on MAE and on % of trips with predicted ETA within 15% of actual. The "graph
advantage" must be measured, not claimed.
4. FTL vs Carting decision framework - Build an ML-backed framework for route-type selection with the
time-cost trade-off quantified for different corridor profiles, accounting for distance, time of day,
and the source facility's graph position.
5. Network Operations Strategy Memo - Written for an operations leader, not a data scientist. Name the
top 5 bottleneck hubs with their estimated SLA breach contribution, recommend corridor-specific
interventions (parallel route, facility upgrade, or route-type shift), and quantify the % reduction in late
deliveries and revenue-at-risk recovered if the top 3 hubs are upgraded. No raw model outputs -
translate findings into decisions.

Expected Impact
By the end of this project, your solution should help Delhivery's operations team:
Make smarter ETA predictions that reduce the gap between promised and actual delivery time
Identify and prioritize hub upgrades based on their structural risk and SLA breach contribution
Reduce SLA breaches on chronic delay corridors through targeted interventions
Improve route-type decisions with a data-backed FTL vs Carting framework
Recover revenue at risk by quantifying and acting on the highest-impact bottlenecks

What your final submission should include


1. Well-documented code, data pipeline, graph construction, model training, and benchmarking
2. Graph visualizations of the logistics network with bottleneck hubs and delay corridors highlighted
3. Model performance comparison (baseline vs. graph-enhanced) on both MAE and the 15%-accuracy
business metric
4. FTL vs Carting decision framework with supporting analysis
5. Network Operations Strategy Memo (1–2 pages) with the top 5 bottleneck hubs, corridor-specific
interventions, and revenue impact estimates
6. Optional: A live Streamlit dashboard showing the network with real-time delay risk scores

Point of Contact
Saksham Gupta - 7009410295
Yashvi Mehta - 6363679144

Dataset & Resources


Link (Dataset) Link (Resources)

You might also like