fix(amber): add timeout and retry to dataset file-service requests#5667
fix(amber): add timeout and retry to dataset file-service requests#5667Ma77Ball wants to merge 7 commits into
Conversation
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 480 | 0.293 | 20,137/28,757/28,757 us | 🔴 -11.8% / 🟢 -17.8% |
| ⚪ | bs=100 sw=10 sl=64 | 1,226 | 0.748 | 80,682/96,233/96,233 us | ⚪ within ±5% / 🟢 +37.5% |
| ⚪ | bs=1000 sw=10 sl=64 | 1,431 | 0.873 | 701,624/730,806/730,806 us | ⚪ within ±5% / 🟢 +37.5% |
Baseline details
Latest main dfa0434 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 480 tuples/sec | 544 tuples/sec | 410.82 tuples/sec | -11.8% | +16.8% |
| bs=10 sw=10 sl=64 | MB/s | 0.293 MB/s | 0.332 MB/s | 0.251 MB/s | -11.7% | +16.9% |
| bs=10 sw=10 sl=64 | p50 | 20,137 us | 18,027 us | 23,785 us | +11.7% | -15.3% |
| bs=10 sw=10 sl=64 | p95 | 28,757 us | 27,273 us | 34,980 us | +5.4% | -17.8% |
| bs=10 sw=10 sl=64 | p99 | 28,757 us | 27,273 us | 34,980 us | +5.4% | -17.8% |
| bs=100 sw=10 sl=64 | throughput | 1,226 tuples/sec | 1,196 tuples/sec | 891.94 tuples/sec | +2.5% | +37.5% |
| bs=100 sw=10 sl=64 | MB/s | 0.748 MB/s | 0.73 MB/s | 0.544 MB/s | +2.5% | +37.4% |
| bs=100 sw=10 sl=64 | p50 | 80,682 us | 83,073 us | 112,277 us | -2.9% | -28.1% |
| bs=100 sw=10 sl=64 | p95 | 96,233 us | 100,618 us | 139,802 us | -4.4% | -31.2% |
| bs=100 sw=10 sl=64 | p99 | 96,233 us | 100,618 us | 139,802 us | -4.4% | -31.2% |
| bs=1000 sw=10 sl=64 | throughput | 1,431 tuples/sec | 1,460 tuples/sec | 1,041 tuples/sec | -2.0% | +37.5% |
| bs=1000 sw=10 sl=64 | MB/s | 0.873 MB/s | 0.891 MB/s | 0.635 MB/s | -2.0% | +37.4% |
| bs=1000 sw=10 sl=64 | p50 | 701,624 us | 687,812 us | 972,714 us | +2.0% | -27.9% |
| bs=1000 sw=10 sl=64 | p95 | 730,806 us | 711,321 us | 1,023,057 us | +2.7% | -28.6% |
| bs=1000 sw=10 sl=64 | p99 | 730,806 us | 711,321 us | 1,023,057 us | +2.7% | -28.6% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,416.57,200,128000,480,0.293,20137.10,28756.87,28756.87
1,100,10,64,20,1631.13,2000,1280000,1226,0.748,80682.38,96232.83,96232.83
2,1000,10,64,20,13977.21,20000,12800000,1431,0.873,701624.06,730806.35,730806.35
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5667 +/- ##
============================================
+ Coverage 52.91% 52.95% +0.03%
+ Complexity 2626 2481 -145
============================================
Files 1090 1075 -15
Lines 42188 41678 -510
Branches 4531 4503 -28
============================================
- Hits 22324 22070 -254
+ Misses 18555 18315 -240
+ Partials 1309 1293 -16
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
file-service requests DatasetFileDocument made two requests.get() calls (presigned-URL fetch and file download) with no timeout, so a hung or unreachable file-service would block the worker thread indefinitely. Route both calls through a Session with a (10s connect, 60s read) timeout and a urllib3 Retry policy (3 retries, exponential backoff, retrying on connection errors and 5xx). Both calls are idempotent GETs, so retrying is safe.
977e455 to
206ac02
Compare
…-document-request-timeout
|
@Yicong-Huang would you be able to review this when you get a chance? It hardens |
|
Sure. Adding @bobbai00 as well PTAL |
|
|
||
| # (connect, read) timeout and retry settings for the file-service GETs below. | ||
| _CONNECT_TIMEOUT_SECONDS = 10 | ||
| _READ_TIMEOUT_SECONDS = 60 | ||
| _REQUEST_TIMEOUT = (_CONNECT_TIMEOUT_SECONDS, _READ_TIMEOUT_SECONDS) | ||
| _MAX_RETRIES = 3 | ||
| _RETRY_BACKOFF_FACTOR = 0.5 | ||
| _RETRY_STATUS_FORCELIST = (500, 502, 503, 504) | ||
|
|
||
|
|
||
| def _build_session() -> requests.Session: | ||
| """Returns a Session that retries GETs on connection errors and 5xx.""" | ||
| retry = Retry( | ||
| total=_MAX_RETRIES, | ||
| connect=_MAX_RETRIES, | ||
| read=_MAX_RETRIES, | ||
| backoff_factor=_RETRY_BACKOFF_FACTOR, | ||
| status_forcelist=_RETRY_STATUS_FORCELIST, | ||
| allowed_methods=frozenset({"GET"}), | ||
| raise_on_status=False, | ||
| ) | ||
| adapter = HTTPAdapter(max_retries=retry) | ||
| session = requests.Session() | ||
| session.mount("http://", adapter) | ||
| session.mount("https://", adapter) | ||
| return session |
There was a problem hiding this comment.
move those into singleton class? I want to avoid global module level instances and methods.
There was a problem hiding this comment.
Done: moved the timeout/retry constants and _build_session into DatasetFileDocument (class attributes + @classmethod), so no module-level state remains.
| session = requests.Session() | ||
| session.mount("http://", adapter) | ||
| session.mount("https://", adapter) | ||
| return session |
There was a problem hiding this comment.
do we need a session just to do retry?
There was a problem hiding this comment.
Yes: requests.get() has no retry option; urllib3's Retry only applies when mounted as an HTTPAdapter on a Session. The alternative is a manual retry loop. Open to other approaches.
What changes were proposed in this PR?
DatasetFileDocument's presigned-URL fetch and file download through arequests.Sessionconfigured with a(10s connect, 60s read)timeout, so a hung or unreachable file-service fails in bounded time instead of blocking the worker thread forever.urllib3Retrypolicy on the session (3 retries, exponential backoff, retrying on connection errors and 5xx). Both calls are idempotent GETs, so the retry set is restricted toGET.RuntimeError, consistent with the module's existing failure handling, so callers get a uniform error contract instead of a rawrequests/urllib3exception.Any related issues, documentation, discussions?
Closes: #5666
How was this PR tested?
pytestcoverage intest_dataset_file_document.py(26 tests):(connect, read)timeout tuple is passed on both the presigned-URL request and the file download;http://andhttps://with the expected policy (total=3,connect=3,read=3,backoff_factor=0.5,status_forcelist={500,502,503,504}, GET-only);ReadTimeout/ConnectionErroris wrapped inRuntimeErroron both code paths.ruff checkandruff format --checkpass on the modified files.Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF