Skip to content

fix(amber): add timeout and retry to dataset file-service requests#5667

Open
Ma77Ball wants to merge 7 commits into
apache:mainfrom
Ma77Ball:fix/dataset-file-document-request-timeout
Open

fix(amber): add timeout and retry to dataset file-service requests#5667
Ma77Ball wants to merge 7 commits into
apache:mainfrom
Ma77Ball:fix/dataset-file-document-request-timeout

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Route DatasetFileDocument's presigned-URL fetch and file download through a requests.Session configured with a (10s connect, 60s read) timeout, so a hung or unreachable file-service fails in bounded time instead of blocking the worker thread forever.
  • Mount a urllib3 Retry policy on the session (3 retries, exponential backoff, retrying on connection errors and 5xx). Both calls are idempotent GETs, so the retry set is restricted to GET.
  • Translate network failures (connect/read timeouts and connection errors, including those surfaced after retries are exhausted) into RuntimeError, consistent with the module's existing failure handling, so callers get a uniform error contract instead of a raw requests/urllib3 exception.

Any related issues, documentation, discussions?

Closes: #5666

How was this PR tested?

  • Added pytest coverage in test_dataset_file_document.py (26 tests):
    • asserts the (connect, read) timeout tuple is passed on both the presigned-URL request and the file download;
    • asserts the retry adapter is mounted for http:// and https:// with the expected policy (total=3, connect=3, read=3, backoff_factor=0.5, status_forcelist={500,502,503,504}, GET-only);
    • asserts a ReadTimeout / ConnectionError is wrapped in RuntimeError on both code paths.
  • ruff check and ruff format --check pass on the modified files.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 5 worse · ⚪ 10 noise (<±5%) · 0 without baseline

Compared against main dfa0434 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 480 0.293 20,137/28,757/28,757 us 🔴 -11.8% / 🟢 -17.8%
bs=100 sw=10 sl=64 1,226 0.748 80,682/96,233/96,233 us ⚪ within ±5% / 🟢 +37.5%
bs=1000 sw=10 sl=64 1,431 0.873 701,624/730,806/730,806 us ⚪ within ±5% / 🟢 +37.5%
Baseline details

Latest main dfa0434 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 480 tuples/sec 544 tuples/sec 410.82 tuples/sec -11.8% +16.8%
bs=10 sw=10 sl=64 MB/s 0.293 MB/s 0.332 MB/s 0.251 MB/s -11.7% +16.9%
bs=10 sw=10 sl=64 p50 20,137 us 18,027 us 23,785 us +11.7% -15.3%
bs=10 sw=10 sl=64 p95 28,757 us 27,273 us 34,980 us +5.4% -17.8%
bs=10 sw=10 sl=64 p99 28,757 us 27,273 us 34,980 us +5.4% -17.8%
bs=100 sw=10 sl=64 throughput 1,226 tuples/sec 1,196 tuples/sec 891.94 tuples/sec +2.5% +37.5%
bs=100 sw=10 sl=64 MB/s 0.748 MB/s 0.73 MB/s 0.544 MB/s +2.5% +37.4%
bs=100 sw=10 sl=64 p50 80,682 us 83,073 us 112,277 us -2.9% -28.1%
bs=100 sw=10 sl=64 p95 96,233 us 100,618 us 139,802 us -4.4% -31.2%
bs=100 sw=10 sl=64 p99 96,233 us 100,618 us 139,802 us -4.4% -31.2%
bs=1000 sw=10 sl=64 throughput 1,431 tuples/sec 1,460 tuples/sec 1,041 tuples/sec -2.0% +37.5%
bs=1000 sw=10 sl=64 MB/s 0.873 MB/s 0.891 MB/s 0.635 MB/s -2.0% +37.4%
bs=1000 sw=10 sl=64 p50 701,624 us 687,812 us 972,714 us +2.0% -27.9%
bs=1000 sw=10 sl=64 p95 730,806 us 711,321 us 1,023,057 us +2.7% -28.6%
bs=1000 sw=10 sl=64 p99 730,806 us 711,321 us 1,023,057 us +2.7% -28.6%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,416.57,200,128000,480,0.293,20137.10,28756.87,28756.87
1,100,10,64,20,1631.13,2000,1280000,1226,0.748,80682.38,96232.83,96232.83
2,1000,10,64,20,13977.21,20000,12800000,1431,0.873,701624.06,730806.35,730806.35

@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.95%. Comparing base (dfa0434) to head (681b686).
⚠️ Report is 5 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5667      +/-   ##
============================================
+ Coverage     52.91%   52.95%   +0.03%     
+ Complexity     2626     2481     -145     
============================================
  Files          1090     1075      -15     
  Lines         42188    41678     -510     
  Branches       4531     4503      -28     
============================================
- Hits          22324    22070     -254     
+ Misses        18555    18315     -240     
+ Partials       1309     1293      -16     
Flag Coverage Δ *Carryforward flag
access-control-service 71.42% <ø> (+0.51%) ⬆️ Carriedforward from ed0d643
agent-service 34.36% <ø> (ø) Carriedforward from ed0d643
amber 53.64% <ø> (+0.62%) ⬆️ Carriedforward from ed0d643
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from ed0d643
config-service 56.71% <ø> (ø) Carriedforward from ed0d643
file-service 57.06% <ø> (ø) Carriedforward from ed0d643
frontend 47.38% <ø> (-0.56%) ⬇️ Carriedforward from ed0d643
pyamber 89.83% <100.00%> (-0.91%) ⬇️
python 90.73% <ø> (-0.01%) ⬇️ Carriedforward from ed0d643
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from ed0d643

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Ma77Ball added 3 commits June 14, 2026 08:54
  file-service requests

  DatasetFileDocument made two requests.get() calls
  (presigned-URL fetch
  and file download) with no timeout, so a hung or
  unreachable
  file-service would block the worker thread
  indefinitely.

  Route both calls through a Session with a (10s
  connect, 60s read)
  timeout and a urllib3 Retry policy (3 retries,
  exponential backoff,
  retrying on connection errors and 5xx). Both calls
  are idempotent GETs,
  so retrying is safe.
@Ma77Ball Ma77Ball force-pushed the fix/dataset-file-document-request-timeout branch from 977e455 to 206ac02 Compare June 14, 2026 15:55
@Ma77Ball Ma77Ball marked this pull request as ready for review June 14, 2026 16:06
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

@Yicong-Huang would you be able to review this when you get a chance? It hardens DatasetFileDocument's file-service requests with bounded timeouts, GET-only retries, and uniform RuntimeError failures (you authored this module originally, so you have the most context here). Thanks!

@Yicong-Huang Yicong-Huang requested a review from bobbai00 June 14, 2026 16:21
@Yicong-Huang

Copy link
Copy Markdown
Contributor

Sure. Adding @bobbai00 as well PTAL

Comment on lines +24 to +49

# (connect, read) timeout and retry settings for the file-service GETs below.
_CONNECT_TIMEOUT_SECONDS = 10
_READ_TIMEOUT_SECONDS = 60
_REQUEST_TIMEOUT = (_CONNECT_TIMEOUT_SECONDS, _READ_TIMEOUT_SECONDS)
_MAX_RETRIES = 3
_RETRY_BACKOFF_FACTOR = 0.5
_RETRY_STATUS_FORCELIST = (500, 502, 503, 504)


def _build_session() -> requests.Session:
"""Returns a Session that retries GETs on connection errors and 5xx."""
retry = Retry(
total=_MAX_RETRIES,
connect=_MAX_RETRIES,
read=_MAX_RETRIES,
backoff_factor=_RETRY_BACKOFF_FACTOR,
status_forcelist=_RETRY_STATUS_FORCELIST,
allowed_methods=frozenset({"GET"}),
raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retry)
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)
return session

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move those into singleton class? I want to avoid global module level instances and methods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: moved the timeout/retry constants and _build_session into DatasetFileDocument (class attributes + @classmethod), so no module-level state remains.

Comment on lines +46 to +49
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)
return session

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need a session just to do retry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: requests.get() has no retry option; urllib3's Retry only applies when mounted as an HTTPAdapter on a Session. The alternative is a manual retry loop. Open to other approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add timeout and retry to DatasetFileDocument file-service requests

3 participants