[Feat]: Try zero-copy serialize objects that can be converted to memoryview by tianyi-ge · Pull Request #147 · TransferQueue/TransferQueue

tianyi-ge · 2025-12-19T02:10:43Z

Background

For yuanrong-datasystem kv_client, it provides mset that supports cpu contiguous objects e.g. memoryview, bytes, and bytearray. The current solution is pickle.dumps everything on cpu, which in fact unnecessarily allocates and calculates serialization for contiguous objects e.g. tensor, view tensor, numpy array, etc.

Solution

This PR try to recognize common data types that can be passed to mset without copy. One limitation here is that user tensor/view tensor cannot be modified during transfer.

Notice

The object passed to try_zero_copy_serialize has to be read-only and kept alive as long as the key not deleted, or in other words potential transfers may happen.
Here I did not utilize serialization in serial_utils because it's suitable for a mixed tensor dict. For yuanrong kvclient.mset, one key-value pair cannot handle a flattened tensor list. A value can either be memoryview-ed or pickle.dumps-ed as a whole.

coderabbitai · 2025-12-19T02:10:49Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

0oshowero0 · 2025-12-19T02:47:03Z

There is a pubic utils for zero-copy serialization in #140 , maybe it can be directly used here so that all the backends can share the same set of serial/deserial utils to improve maintainability.

tianyi-ge · 2025-12-19T03:33:47Z

There is a pubic utils for zero-copy serialization in #140 , maybe it can be directly used here so that all the backends can share the same set of serial/deserial utils to improve maintainability.

Yep I noticed this utility. As I mentioned in the PR description, yuanrong kv client expects a memoryview as an argument. Is it possible to extract the memoryview from the return value of serialization?

TransferQueue/transfer_queue/utils/serial_utils.py

Line 214 in 1c0a1f1

return [pickled_bytes, pickle.dumps(nested_tensor_info), *serialized_tensors]

0oshowero0 · 2025-12-19T04:00:52Z

I see. There is a little bit tricky since both MsgpackEncoder and the general serialization cannot directly return only the memory view obj. If yuanrong do not need to tackle tensordict as a whole or nested tensor (which should be already unbinded to tensors), I think it's ok to use memoryview(obj.numpy()) directly.

0oshowero0 · 2025-12-19T04:15:23Z

Need to make sure all these scenarios are considered:

  nested_tensor = torch.nested.as_nested_tensor(
      [torch.randn(4, 3), torch.randn(2, 4)], layout=torch.strided
  ),
  jagged_tensor= torch.nested.as_nested_tensor(
      [torch.randn(4, 5), torch.randn(4, 54)], layout=torch.jagged
  ),

Theoretically, these nested tensors should be splited into single tensors during making values. But in practice, I recommend add more check to make sure there is not nested tensor trying to use memory view.

tensor.is_nested and tensor.layout == torch.strided

tianyi-ge · 2025-12-22T01:43:30Z

After discussion, we agree that serial.utils.serialization can extract tensor and non-tensor, but for one key yuanrong can only handle one memoryview/bytes/bytearray/str. In other words, yuanrong mset interface is not as flexible as vanilla storage unit implementation. In fact, it's possible taht other backend has the same semantics as yuanrong. I'm afraid they can't utilize serial.utils.serialization either. So here we can accept a small zcopy helper inside yuarnong impl.

The test result shows that zcopy brings 5~20% mitigation of total latency of put, and 2x get throughput compared to clone each view tensor.

dpj135 · 2025-12-22T09:04:17Z

+    if isinstance(obj, torch.Tensor) and obj.is_contiguous():
+        return memoryview(obj.numpy())
+    try:
+        return memoryview(obj)


It seems that memoryview(obj.numpy()) can also be used for non-contiguous tensors. Is its performance better than pickle.dump()?

It might be dangerous. We observe vLLM zero-copy implementation make it continues before transport: https://github.com/vllm-project/vllm/blob/main/vllm/v1/serial_utils.py#L227

dpj135 · 2025-12-22T09:07:45Z

                else:
                    cpu_keys.append(key)
-                    cpu_values.append(pickle.dumps(value))
+                    cpu_values.append(try_zero_copy_serialize(value))


Need to modify _batch_get() synchronously.

This might require a method to determine whether the data is processed via memoryviewor pickle.

you are right. the metadata of each value is required to deserialize various data types. maybe we can add a metadata head field in front of each value, and write new ser/des functions to deal with all kv storages (yuanrong, mooncake, etc.)

for example,

def kv_storage_serialize(obj) -> memoryview | bytes | bytearray | str: # determine which serialization method is the best # prepend metadata and deserialization method id in front of main data def kv_storage_deserialize(values: bytes): # parse metadata and get deserialization method id # choose the corresponding deserializer # remove the metadata head and apply on the main data

one concern is that if we prepend metadata (with a certain length), it may worsen the zero-copy performance.

dpj135 · 2025-12-22T09:08:24Z

        else:
            #  All data goes through CPU path
-            pickled_values = [pickle.dumps(v) for v in values]
+            pickled_values = [try_zero_copy_serialize(v) for v in values]


Same as above.

0oshowero0 · 2025-12-22T12:21:46Z

btw, where is the corresponding deserialization for memory view zero copy?

tianyi-ge · 2025-12-22T12:39:33Z

btw, where is the corresponding deserialization for memory view zero copy?

you are right, I'm redesigning deserialization to make it work for both pickled bytes and memoryview. it's expected be reused by other kv backends

tianyi-ge · 2025-12-25T09:41:44Z

I'm afraid it's impossible to achieve the following targets at the same time

use current set(keys, valus) where vals are memoryview or bytes or bytearray or str, and get(keys) -> list[bytes]
support custom zero-copy ser/des for certain types

The proof: if we want to support custom ser/des, we need metadata and type id to decide which deserializer to use on a bytes. These extra information can be accessed by either 1. concat with real value; 2. store in another kv pair. Neither of them is an ideal choice.

concat still causes copy. e.g. np.ndarray -> concat with metadata as one big bytes -> set
new kv pair doubles the kv management pressure of control plane

what we need is something like np.ndarray --(directly write into)--> rdma memory region, only one copy. It's possible if set can provide a buffer from registered rdma memory region for users to write their array and metadata into. It's a feasible way since yuanrong already supports this in c++ but not in python.

zmq can support this because of send_multipart. It transfers multiple contiguous memory at the same time. The receiver can parse some of them as metadata and some of them as real data, so that custom deserializer is possible. This is another way to tackle: set supports one key -> multiple contiguous memory (e.g. metadata and real data memoryview).

In summary, two feasible ways to achieve custom zero-copy deserializer with kv system:

kv system supports create_buffer + set semantics: directly write to memory region, only one copy

metadata = pickle.dumps(prepare_metadata(array))
size = len(metadata) + len(array)
buffers = mcreate([key], [size])
buffer = buffers[0]
buffer[:len(metadata)] = metadata
buffer[len(metadata)] = memoryview(array)
kv_client.mset(buffers)

kv system supports key : list[val] to avoid concat on user side

metadata = pickle.dumps(prepare_metadata(array))
kv_client.mset([key], [[metadata, memoryview(array)]])

dpj135 · 2025-12-25T11:26:07Z

I'm afraid it's impossible to achieve the following targets at the same time

use current set(keys, valus) where vals are memoryview or bytes or bytearray or str, and get(keys) -> list[bytes]

support custom zero-copy ser/des for certain types

The proof: if we want to support custom ser/des, we need metadata and type id to decide which deserializer to use on a bytes. These extra information can be accessed by either 1. concat with real value; 2. store in another kv pair. Neither of them is an ideal choice.

concat still causes copy. e.g. np.ndarray -> concat with metadata as one big bytes -> set

new kv pair doubles the kv management pressure of control plane

what we need is something like np.ndarray --(directly write into)--> rdma memory region, only one copy. It's possible if set can provide a buffer from registered rdma memory region for users to write their array and metadata into. It's a feasible way since yuanrong already supports this in c++ but not in python.

zmq can support this because of send_multipart. It transfers multiple contiguous memory at the same time. The receiver can parse some of them as metadata and some of them as real data, so that custom deserializer is possible. This is another way to tackle: set` supports one key -> multiple contiguous memory (e.g. metadata and real data memoryview).

In summary, two feasible ways to achieve custom zero-copy deserializer with kv system:

kv system supports create_buffer + set semantics: directly write to memory region, only one copy

kv system supports key : list[val] to avoid concat on user side.

Could we transfer metadata into tensor？And then add metadata to tensordata using view or others if memoryview support non-contiguous tensor.

tianyi-ge · 2025-12-25T11:41:49Z

Could we transfer metadata into tensor？And then add metadata to tensordata using view or others if memoryview support non-contiguous tensor.

so far I can't think of a good idea, becausememoryview can only support non-contiguous memory segments that are originally from one big large memory segment (e.g. memoryview(data[::2])), but memoryview(array, metadata) will fail.

0oshowero0 · 2025-12-25T11:49:31Z

kv system supports create_buffer + set semantics: directly write to memory region, only one copy

For this approach, how can the receiver deserialize the data? (how can we know the size of metadata part)?

0oshowero0 · 2025-12-25T12:01:30Z

I'm not sure what is the drawback of this implementation: we store the metadata and give it a dedicated key (by appending a suffix like _meta/_data to the original key), treating it as a separate object alongside the raw data.

This will essentially double the total key-value pairs in the system, and we need to ensure atomicity between the _meta and _data keys—if only one of them is successfully operated (e.g., data is updated but metadata fails to write), it will result in invalid metadata-data mappings, leading to deserialization failures or data corruption.

Is that acceptable?

metadata = pickle.dumps(prepare_metadata(array))
kv_client.put(keys=['key_meta', key_data], values=[metadata, array])

tianyi-ge · 2025-12-25T14:29:26Z

For this approach, how can the receiver deserialize the data? (how can we know the size of metadata part)?

we can either set the length of metadata as a constant, or prepend the length of metadata at the beginning. for example, metadata_length (uint32)

mset([key], [[metadata_length, metadata, array]])

a little dirty :)

tianyi-ge · 2025-12-25T14:34:29Z

Ya I'll test if meta suffix significantly affects the latency/throughput of yuanrong

tianyi-ge · 2025-12-29T12:39:47Z

I wrote a sample use case of one-copy set/get on CPU side. The following code will be written in tq and hopefully can be merged to yuanrong

def mset_zcopy(keys: list[str], objects: list[Any]):
	# prepare metadata and memoryview
	for obj in objects:
		md, mv = serialize(obj) # if any, pickle
		serialized_metas.append(pickle.dumps(md))
		memoryviews.append(mv)
		metalen = len(serialized_metas[-1]) # HEADER_SIZE = 4 bytes
		sizes.append(HEADER_SIZE + metalen + len(mv))
		# | HEADER | SERIALIZED_META | MEMORYVIEW |

	# copy metadata and memoryviews to rdma buffer
	buffers = kv_client.mcreate(keys, sizes)
	for buffer, md, mv in zip(buffers, serialized_metas, memoryviews):
		fill_header(buffer, len(md))
		buffer[HEADER_SIZE : HEADER_SIZE + len(HEADER_SIZE)] = md
		cursor = HEADER_SIZE + len(HEADER_SIZE)
		buffer[cursor : cursor + len(mv)] = mv # one copy

	kv_client.mset(buffers)

def mget_zcopy(keys: list[str]) -> list[Any]:
	values = kv_client.get(keys)
	for val in values:
		metalen = parse_header(val)
		md = pickle.loads(val[HEADER_SIZE : HEADER_SIZE + metalen])
		mv = memoryview(val[HEADER_SIZE + metalen :])
		objects.append(deserialize(md, mv))
	return objects

where serialize can utilize pickle.dumps(protocol=pickle.HIGHEST_PROTOCOL)

def serialize(obj):
	mvs = []
	meta = pickle.dumps(obj, buffer_callback=mvs.append, protocol=pickle.HIGHEST_PROTOCOL)
	assert len(mvs) == 1
	return meta, memoryview(mv[0])

For simplicity, I assume only one contiguous memoryview. To support objects with multiple contiguous chunks, some logic needs to be modified.

dpj135 reviewed Dec 22, 2025

View reviewed changes

dpj135 self-requested a review December 22, 2025 09:09

tianyi-ge marked this pull request as draft December 22, 2025 11:37

add zero copy prototype if mcreate is ready

30c0563

tianyi-ge force-pushed the gty/zcopy branch from 3d2093c to 30c0563 Compare December 30, 2025 16:16

dpj135 mentioned this pull request Jan 8, 2026

[Fix] Fix YuanrongStorageClient #170

Closed

Evelynn-V mentioned this pull request Jan 16, 2026

[Feat] Introduce Zero-Copy to use YuanrongStorageClient for transmitting CPU Tensors #171

Open

Evelynn-V mentioned this pull request Jan 26, 2026

[fix] Fix UT of YuanrongStorageClient Ascend/TransferQueue#12

Merged

Conversation

tianyi-ge commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Solution

Notice

Uh oh!

coderabbitai Bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

0oshowero0 commented Dec 19, 2025

Uh oh!

tianyi-ge commented Dec 19, 2025

Uh oh!

0oshowero0 commented Dec 19, 2025

Uh oh!

0oshowero0 commented Dec 19, 2025

Uh oh!

tianyi-ge commented Dec 22, 2025

Uh oh!

dpj135 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

0oshowero0 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

dpj135 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

dpj135 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

tianyi-ge Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

dpj135 Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

0oshowero0 commented Dec 22, 2025

Uh oh!

tianyi-ge commented Dec 22, 2025

Uh oh!

tianyi-ge commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dpj135 commented Dec 25, 2025

Uh oh!

tianyi-ge commented Dec 25, 2025

Uh oh!

0oshowero0 commented Dec 25, 2025

Uh oh!

0oshowero0 commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyi-ge commented Dec 25, 2025

Uh oh!

tianyi-ge commented Dec 25, 2025

Uh oh!

tianyi-ge commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tianyi-ge commented Dec 19, 2025 •

edited

Loading

coderabbitai Bot commented Dec 19, 2025 •

edited

Loading

tianyi-ge commented Dec 25, 2025 •

edited

Loading

0oshowero0 commented Dec 25, 2025 •

edited

Loading

tianyi-ge commented Dec 29, 2025 •

edited

Loading