[Feat]: Add RayStorageClient to support the RDT feature of Ray by Evelynn-V · Pull Request #102 · TransferQueue/TransferQueue

Evelynn-V · 2025-11-06T07:33:19Z

Summary

Implemented the class RayStorageClient and tested on the cpu using object_store (requires using NIXL on GPUs further testing)

Change

Added transfer_queue/storage/managers/ray_kv_manager.py : Added class RayKVStorageManager, inherited from class KVStroageManager, has been added to implement the verification of device_id configuration and the initialization of ray.
Added transfer_queue/storage/clients/ray_storage_clients.py : Added classes RayStorageClient and RayGpuObjectRefStorage, encapsulating the invocation of the NIXL transport interface used in Ray.
Add simple unit tests: tests/test_ray_storage_client.py .

Testing

Test RayStorageClient on CPU (Requires installing ray >= 2.50):

pytest tests/test_ray_storage_client.py::test_ray_storage_put_get

-- Test RayStorageClient on GPU:

pytest tests/test_ray_storage_client.py::test_nixl_vs_object_store_performance -s

TODO

Test the transfer of data using NIXL on GPUs with IB and without gdrcopy
Test the transfer of data using NIXL on GPUs with gdrcopy
Test the transfer of data using NIXL on GPUs with cuda_ipc

Summary by CodeRabbit

New Features
- Added Ray-based GPU-aware storage client for efficient tensor handling and retrieval
- Enabled concurrent tensor storage and management across multiple clients
Tests
- Introduced comprehensive test suite covering storage operations and concurrent access scenarios

coderabbitai · 2025-11-06T07:33:38Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding RayStorageClient to support the RDT feature of Ray, which aligns with the core changes across all modified files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dpj135 · 2025-11-07T01:54:33Z

+class RayStorageClient(TransferQueueStorageKVClient):
+
+    def __init__(self, config: dict[str, Any]):  
+        if not ray.is_initialized():  


ray.init() is neccessary. However, this check should probably not be performed by the storage client; it might be the responsibility of the user or the upper-level interface.

a tricky thing here: ray driver does not support ray.put(v, _tensor_transport="nixl") . so if ray is inited here, this process becomes a driver. The following put() will fail. So I prefer to raise an error here

dpj135 · 2025-11-07T02:01:44Z

+        values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes)
+        storage = ray.get_actor("RayGpuObjectRefStorage")
+
+        gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys))
+        # values = ray.get(gpu_obj_refs)
+        values = ray.get(gpu_obj_refs, _tensor_transport="nixl")


values is called between line:112 and line:115. ☀_create_empty_tensorlist() is redundant, we can remove it.

dpj135 · 2025-11-07T02:02:38Z

+        if not ray.is_initialized():
+            ray.init()


Same as above.

dpj135 · 2025-11-07T02:10:48Z

+        if self.use_gpu:  
+            gpu_ids = ray.get_gpu_ids()  
+            if gpu_ids:  
+                self.device_id = gpu_ids[0]  
+            else:  
+                self.device_id = config.get("device_id", 0)  
+            torch.cuda.set_device(self.device_id)  


Users may call torch.cuda.set_devcie() in __main__. It might be better to first check whether the gpu_device has already been set in the environment. If not, then proceed to determine the self.device_id (and self.device_id should probably prioritize the value specified in the config).

0oshowero0 · 2025-11-07T03:11:15Z

@coderabbitai review

coderabbitai · 2025-11-07T03:11:22Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copilot

Pull Request Overview

This PR adds Ray-based distributed storage support to the transfer queue system, enabling tensor storage and retrieval using Ray's distributed object store with optional NIXL transport for GPU tensors.

Implements RayStorageClient for storing and retrieving tensors via Ray's object store
Adds RayKVStorageManager to manage Ray-based key-value storage operations
Includes comprehensive tests for basic operations and multi-client concurrent scenarios

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
transfer_queue/storage/clients/ray_storage_client.py	Implements Ray storage client with GPU/CPU support and NIXL transport for distributed tensor operations
transfer_queue/storage/managers/ray_kv_manager.py	Adds Ray-based KV storage manager with device validation
transfer_queue/storage/clients/init.py	Exports the new RayStorageClient class
tests/test_ray_storage_client.py	Provides unit tests for Ray storage client operations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-07T03:14:09Z

+        Store tensors to remote storage.
+        Args:
+            keys (list): List of string keys
+            values (list): List of torch.Tensor on NPU


The docstring incorrectly states 'List of torch.Tensor on NPU' but this client supports both GPU (CUDA) and CPU tensors, not NPU. Update to reflect actual supported devices.

Suggested change

values (list): List of torch.Tensor on NPU

values (list): List of torch.Tensor on GPU (CUDA) or CPU

Copilot · 2025-11-07T03:14:09Z

+            shapes (list): Expected shapes of returned tensors
+            dtypes (list): Expected dtypes of returned tensors
+        Returns:
+            list: List of retrieved NPU tensors


The docstring incorrectly states 'List of retrieved NPU tensors' but this returns either CUDA or CPU tensors, not NPU. Update to accurately describe the return type.

Copilot · 2025-11-07T03:14:09Z

+        Create a list of empty GPU tensors with given shapes and dtypes.
+        Args:
+            shapes (list): List of tensor shapes (e.g., [(3,), (2, 4)])
+            dtypes (list): List of torch dtypes (e.g., [torch.float32, torch.int64])
+        Returns:
+            list: List of uninitialized GPU tensors


The docstring says 'empty GPU tensors' but when use_gpu is False, this method creates CPU tensors. Update to 'Create a list of empty tensors with given shapes and dtypes' for accuracy.

Suggested change

Create a list of empty GPU tensors with given shapes and dtypes.

Args:

shapes (list): List of tensor shapes (e.g., [(3,), (2, 4)])

dtypes (list): List of torch dtypes (e.g., [torch.float32, torch.int64])

Returns:

list: List of uninitialized GPU tensors

Create a list of empty tensors with given shapes and dtypes.

The tensors are created on the device specified by self.device (CPU or GPU).

Args:

shapes (list): List of tensor shapes (e.g., [(3,), (2, 4)])

dtypes (list): List of torch dtypes (e.g., [torch.float32, torch.int64])

Returns:

list: List of uninitialized tensors

Copilot · 2025-11-07T03:14:09Z

+            shapes (list): List of tensor shapes (e.g., [(3,), (2, 4)])
+            dtypes (list): List of torch dtypes (e.g., [torch.float32, torch.int64])
+        Returns:
+            list: List of uninitialized GPU tensors


The return description states 'GPU tensors' but the method can return CPU tensors when use_gpu is False. Update to 'list: List of uninitialized tensors'.

Suggested change

list: List of uninitialized GPU tensors

list: List of uninitialized tensors

Copilot · 2025-11-07T03:14:10Z

+        if len(dtypes) != len(shapes):
+            raise ValueError("Length of dtypes must equal length of shapes")
+
+        values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes)


The variable values is assigned from _create_empty_tensorlist but then immediately overwritten on line 116 without being used. Remove this unused assignment.

Suggested change

values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes)

Copilot · 2025-11-07T03:14:10Z

+
+        gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys))
+        # values = ray.get(gpu_obj_refs)
+        values = ray.get(gpu_obj_refs, _tensor_transport="nixl")


The code unconditionally uses _tensor_transport='nixl' even when self.use_gpu is False. This should be conditional like in the put method to avoid errors when running on CPU-only systems.

Suggested change

values = ray.get(gpu_obj_refs, _tensor_transport="nixl")

if self.use_gpu:

values = ray.get(gpu_obj_refs, _tensor_transport="nixl")

else:

values = ray.get(gpu_obj_refs)

Copilot · 2025-11-07T03:14:10Z

+from transfer_queue.storage.managers.base import KVStorageManager
+
+
+class RayKVStorageManager(KVStorageManager):


The RayKVStorageManager class is not registered with the TransferQueueStorageManagerFactory and is not exported in the __init__.py. This makes it inaccessible through the factory pattern used elsewhere in the codebase.

Copilot · 2025-11-07T03:14:11Z

+
+    for i, client in enumerate(clients):    
+        keys = [f"client_{i}_tensor_{j}" for j in range(3)]    
+        values = [torch.randn(10, 10) * i for _ in range(3)] 


The variable i is used in the list comprehension but the loop variable is _. This means all three tensors have the same multiplier. Change _ to j or use i consistently.

Suggested change

values = [torch.randn(10, 10) * i for _ in range(3)]

values = [torch.randn(10, 10) * j for j in range(3)]

Copilot · 2025-11-07T03:14:11Z

+import torch  
+import sys
+from pathlib import Path
+from tensordict import TensorDict  


Import of 'TensorDict' is not used.

Suggested change

from tensordict import TensorDict

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8433e17 and ec32d98.

📒 Files selected for processing (4)

tests/test_ray_storage_client.py (1 hunks)
transfer_queue/storage/clients/__init__.py (1 hunks)
transfer_queue/storage/clients/ray_storage_client.py (1 hunks)
transfer_queue/storage/managers/ray_kv_manager.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Agent

coderabbitai · 2025-11-07T03:16:22Z

+        if len(dtypes) != len(shapes):
+            raise ValueError("Length of dtypes must equal length of shapes")
+
+        values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes)
+        storage = ray.get_actor("RayGpuObjectRefStorage")
+
+        gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys))
+        # values = ray.get(gpu_obj_refs)
+        values = ray.get(gpu_obj_refs, _tensor_transport="nixl")
+        return values


⚠️ Potential issue | 🔴 Critical

Handle optional shapes/dtypes before calling len()

get() declares shapes/dtypes as optional but immediately calls len() on them, so any caller using the defaults hits a TypeError. Enforce that both are provided (or supply a fallback) before taking their length.

Please apply:

- if len(dtypes) != len(shapes): - raise ValueError("Length of dtypes must equal length of shapes") - - values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes) + if shapes is None or dtypes is None: + raise ValueError("Both shapes and dtypes are required when retrieving tensors") + if len(dtypes) != len(shapes): + raise ValueError("Length of dtypes must equal length of shapes")

🤖 Prompt for AI Agents

In transfer_queue/storage/clients/ray_storage_client.py around lines 108-117, the method treats shapes and dtypes as optional but calls len() on them immediately, causing a TypeError when they are None; add a guard before the length check to ensure both shapes and dtypes are provided (or explicitly set sensible defaults) — e.g., if either is None raise a clear ValueError like "shapes and dtypes must be provided" (or derive defaults) before using len(), then proceed with creating tensors and fetching GPU refs as before.

coderabbitai · 2025-11-07T03:16:22Z

+        gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys))
+        # values = ray.get(gpu_obj_refs)
+        values = ray.get(gpu_obj_refs, _tensor_transport="nixl")
+        return values


⚠️ Potential issue | 🔴 Critical

Match tensor transport with how the refs were stored

In the CPU path put() uses the default object store, but get() forces _tensor_transport="nixl". If NIXL isn’t installed or the refs came from the object store (the CPU case), ray.get raises. Pick the transport conditionally so CPU callers use the object store while GPU callers continue with NIXL.(docs.ray.io)

Suggested fix:

- gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys)) - # values = ray.get(gpu_obj_refs) - values = ray.get(gpu_obj_refs, _tensor_transport="nixl") + gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys)) + if self.use_gpu: + values = ray.get(gpu_obj_refs, _tensor_transport="nixl") + else: + values = ray.get(gpu_obj_refs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys))

# values = ray.get(gpu_obj_refs)

values = ray.get(gpu_obj_refs, _tensor_transport="nixl")

return values

gpu_obj_refs = ray.get(storage.get_gpu_obj_ref.remote(keys))

if self.use_gpu:

values = ray.get(gpu_obj_refs, _tensor_transport="nixl")

else:

values = ray.get(gpu_obj_refs)

return values

🤖 Prompt for AI Agents

In transfer_queue/storage/clients/ray_storage_client.py around lines 114-117, the code unconditionally calls ray.get(..., _tensor_transport="nixl") which will fail for refs stored in the default CPU object store or when NIXL isn't available; change this to try the NIXL transport first and if ray.get raises (TypeError/RuntimeError or any RayError indicating unsupported transport), fall back to calling ray.get without the _tensor_transport argument so CPU-backed refs succeed — implement a try/except around the ray.get call and return the successful result from either the NIXL attempt or the default fallback.

0oshowero0 · 2025-11-07T03:19:04Z

+from transfer_queue.storage.clients.base import TransferQueueStorageKVClient
+from transfer_queue.storage.clients.factory import StorageClientFactory
+
+@ray.remote


Maybe we should set @ray.remote(max_concurrency=XX) to enable concurrent remote call for better performance

0oshowero0 · 2025-11-07T03:21:14Z

+from transfer_queue.storage.clients.factory import StorageClientFactory
+
+@ray.remote
+class RayGpuObjectRefStorage:


I think this class can also be used when RDT is not available. So maybe it should be called as RayObjectRefStorage?

0oshowero0 · 2025-11-07T03:26:03Z

+        Store tensors to remote storage.
+        Args:
+            keys (list): List of string keys
+            values (list): List of torch.Tensor on NPU


users may also use ray to transport torch.Tensor on CPU

0oshowero0 · 2025-11-07T03:29:39Z

+            values (list): List of torch.Tensor on NPU
+        """
+        if not isinstance(keys, list) or not isinstance(values, list):
+            raise ValueError("keys and values must be lists")


Suggested change

raise ValueError("keys and values must be lists")

raise ValueError(f"keys and values must be lists, but got {type(keys)} and {type(values)}")

0oshowero0 · 2025-11-07T03:40:16Z

+            if not isinstance(value, torch.Tensor):
+                raise ValueError(f"Expected torch.Tensor, got {type(value)}")
+
+        # TODO: NIXL can only be initialized in an environment with GPU, even if data is transferred on the cpu.


If already fixed, simply delete the TODO~

0oshowero0 · 2025-11-07T03:51:12Z

+            obj_refs = [ray.put(v) for v in values]  
+        # obj_refs = [ray.put(v, _tensor_transport="nixl") for v in values]
+
+        storage = RayGpuObjectRefStorage.options(


Maybe we can put this during __init__ and use self.storage to prevent frequent interaction with raylet

0oshowero0 · 2025-11-07T03:51:26Z

+            raise ValueError("Length of dtypes must equal length of shapes")
+
+        values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes)
+        storage = ray.get_actor("RayGpuObjectRefStorage")


same as above comment

0oshowero0 · 2025-11-07T03:55:00Z

+        Returns:
+            list: List of retrieved NPU tensors
+        """
+        if len(dtypes) != len(shapes):


shape and dtypes can be None, which will cause TypeError when calling len(dtypes)

0oshowero0 · 2025-11-07T03:56:14Z

+        if len(dtypes) != len(shapes):
+            raise ValueError("Length of dtypes must equal length of shapes")
+
+        values: list[Tensor] = self._create_empty_tensorlist(shapes=shapes, dtypes=dtypes)


We need to support fallback logic when shapes and dtypes are not available when we try to use ray to store non tensor objects such as np.array, str, etc.

check if it is tensor and can be used in rdt automatically, when not available, use ordinary ray obj store

0oshowero0 · 2025-11-07T03:57:27Z

+        Args:
+            keys (list): List of keys to delete
+        """
+        storage = ray.get_actor("RayGpuObjectRefStorage")


same as above comment

0oshowero0 · 2025-11-07T03:58:00Z

+
+class RayKVStorageManager(KVStorageManager):
+    def __init__(self, config: dict[str, Any]):
+        device_id = config.get("device_id", None)


what is this for?

It has been deleted because we don't care about device_id.

ji-huazhong · 2025-11-10T03:40:03Z

+
+        self.use_gpu = torch.cuda.is_available()  
+
+        if self.use_gpu:  


Does the NPU support RDT? In other words, do we need to add a monkey patch to the transfer queue to support this feature?

Probably not feasible. RDT is based on NIXL for transmission, while NIXL only supports GPU and CPU.

Introduce RDT into TQ

a4b75f5

Introduce RDT into TQ

ec32d98

dpj135 reviewed Nov 7, 2025

View reviewed changes

0oshowero0 requested a review from Copilot November 7, 2025 03:11

Copilot AI reviewed Nov 7, 2025

View reviewed changes

coderabbitai Bot reviewed Nov 7, 2025

View reviewed changes

0oshowero0 reviewed Nov 7, 2025

View reviewed changes

tianyi-ge reviewed Nov 7, 2025

View reviewed changes

Comment thread transfer_queue/storage/clients/ray_storage_client.py Outdated

Comment thread tests/test_ray_storage_client.py Outdated

Comment thread transfer_queue/storage/managers/ray_kv_manager.py Outdated

Comment thread transfer_queue/storage/clients/ray_storage_client.py Outdated

ji-huazhong reviewed Nov 10, 2025

View reviewed changes

Evelynn-V added 2 commits November 11, 2025 19:43

Merge remote-tracking branch 'upstream/dev' into RDT

7b458bf

add demos and fix

9533b65

Evelynn-V mentioned this pull request Nov 17, 2025

[Feat]: Add RayStorageClient to support the RDT feature of Ray #108

Open

3 tasks

Evelynn-V closed this Jan 4, 2026

Evelynn-V mentioned this pull request Jan 4, 2026

[Feat]: Add RayStorageClient to support the RDT feature of Ray #167

Merged

3 tasks

Evelynn-V mentioned this pull request Jan 15, 2026

[Feat] Introduce Zero-Copy to use YuanrongStorageClient for transmitting CPU Tensors #171

Open

	values (list): List of torch.Tensor on NPU
	values (list): List of torch.Tensor on GPU (CUDA) or CPU

	list: List of uninitialized GPU tensors
	list: List of uninitialized tensors

		from transfer_queue.storage.managers.base import KVStorageManager


		class RayKVStorageManager(KVStorageManager):

	values = [torch.randn(10, 10) * i for _ in range(3)]
	values = [torch.randn(10, 10) * j for j in range(3)]

	raise ValueError("keys and values must be lists")
	raise ValueError(f"keys and values must be lists, but got {type(keys)} and {type(values)}")

Conversation

Evelynn-V commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Testing

Related Links

TODO

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Pre-merge checks and finishing touches

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0oshowero0 commented Nov 7, 2025

Uh oh!

coderabbitai Bot commented Nov 7, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Evelynn-V commented Nov 6, 2025 •

edited

Loading

coderabbitai Bot commented Nov 6, 2025 •

edited

Loading

0oshowero0 Nov 7, 2025 •

edited

Loading