Skip to content
This repository was archived by the owner on Jan 21, 2026. It is now read-only.

[Feat]: Try zero-copy serialize objects that can be converted to memoryview#147

Draft
tianyi-ge wants to merge 1 commit into
devfrom
gty/zcopy
Draft

[Feat]: Try zero-copy serialize objects that can be converted to memoryview#147
tianyi-ge wants to merge 1 commit into
devfrom
gty/zcopy

Conversation

@tianyi-ge

@tianyi-ge tianyi-ge commented Dec 19, 2025

Copy link
Copy Markdown

Background

For yuanrong-datasystem kv_client, it provides mset that supports cpu contiguous objects e.g. memoryview, bytes, and bytearray. The current solution is pickle.dumps everything on cpu, which in fact unnecessarily allocates and calculates serialization for contiguous objects e.g. tensor, view tensor, numpy array, etc.

Solution

This PR try to recognize common data types that can be passed to mset without copy. One limitation here is that user tensor/view tensor cannot be modified during transfer.

Notice

  1. The object passed to try_zero_copy_serialize has to be read-only and kept alive as long as the key not deleted, or in other words potential transfers may happen.
  2. Here I did not utilize serialization in serial_utils because it's suitable for a mixed tensor dict. For yuanrong kvclient.mset, one key-value pair cannot handle a flattened tensor list. A value can either be memoryview-ed or pickle.dumps-ed as a whole.

@coderabbitai

coderabbitai Bot commented Dec 19, 2025

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

@0oshowero0

Copy link
Copy Markdown
Member

There is a pubic utils for zero-copy serialization in #140 , maybe it can be directly used here so that all the backends can share the same set of serial/deserial utils to improve maintainability.

@tianyi-ge

Copy link
Copy Markdown
Author

There is a pubic utils for zero-copy serialization in #140 , maybe it can be directly used here so that all the backends can share the same set of serial/deserial utils to improve maintainability.

Yep I noticed this utility. As I mentioned in the PR description, yuanrong kv client expects a memoryview as an argument. Is it possible to extract the memoryview from the return value of serialization?

return [pickled_bytes, pickle.dumps(nested_tensor_info), *serialized_tensors]

@0oshowero0

Copy link
Copy Markdown
Member

I see. There is a little bit tricky since both MsgpackEncoder and the general serialization cannot directly return only the memory view obj. If yuanrong do not need to tackle tensordict as a whole or nested tensor (which should be already unbinded to tensors), I think it's ok to use memoryview(obj.numpy()) directly.

@0oshowero0

Copy link
Copy Markdown
Member

Need to make sure all these scenarios are considered:

  nested_tensor = torch.nested.as_nested_tensor(
      [torch.randn(4, 3), torch.randn(2, 4)], layout=torch.strided
  ),
  jagged_tensor= torch.nested.as_nested_tensor(
      [torch.randn(4, 5), torch.randn(4, 54)], layout=torch.jagged
  ),

Theoretically, these nested tensors should be splited into single tensors during making values. But in practice, I recommend add more check to make sure there is not nested tensor trying to use memory view.

tensor.is_nested and tensor.layout == torch.strided

@tianyi-ge

Copy link
Copy Markdown
Author

After discussion, we agree that serial.utils.serialization can extract tensor and non-tensor, but for one key yuanrong can only handle one memoryview/bytes/bytearray/str. In other words, yuanrong mset interface is not as flexible as vanilla storage unit implementation. In fact, it's possible taht other backend has the same semantics as yuanrong. I'm afraid they can't utilize serial.utils.serialization either. So here we can accept a small zcopy helper inside yuarnong impl.

The test result shows that zcopy brings 5~20% mitigation of total latency of put, and 2x get throughput compared to clone each view tensor.

Comment on lines +42 to +45
if isinstance(obj, torch.Tensor) and obj.is_contiguous():
return memoryview(obj.numpy())
try:
return memoryview(obj)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that memoryview(obj.numpy()) can also be used for non-contiguous tensors. Is its performance better than pickle.dump()?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be dangerous. We observe vLLM zero-copy implementation make it continues before transport: https://github.com/vllm-project/vllm/blob/main/vllm/v1/serial_utils.py#L227

else:
cpu_keys.append(key)
cpu_values.append(pickle.dumps(value))
cpu_values.append(try_zero_copy_serialize(value))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to modify _batch_get() synchronously.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might require a method to determine whether the data is processed via memoryviewor pickle.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. the metadata of each value is required to deserialize various data types. maybe we can add a metadata head field in front of each value, and write new ser/des functions to deal with all kv storages (yuanrong, mooncake, etc.)

for example,

def kv_storage_serialize(obj) -> memoryview | bytes | bytearray | str:
    # determine which serialization method is the best
    # prepend metadata and deserialization method id in front of main data

def kv_storage_deserialize(values: bytes):
    # parse metadata and get deserialization method id
    # choose the corresponding deserializer
    # remove the metadata head and apply on the main data

one concern is that if we prepend metadata (with a certain length), it may worsen the zero-copy performance.

else:
# All data goes through CPU path
pickled_values = [pickle.dumps(v) for v in values]
pickled_values = [try_zero_copy_serialize(v) for v in values]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@dpj135 dpj135 self-requested a review December 22, 2025 09:09
@tianyi-ge tianyi-ge marked this pull request as draft December 22, 2025 11:37
@0oshowero0

Copy link
Copy Markdown
Member

btw, where is the corresponding deserialization for memory view zero copy?

@tianyi-ge

Copy link
Copy Markdown
Author

btw, where is the corresponding deserialization for memory view zero copy?

you are right, I'm redesigning deserialization to make it work for both pickled bytes and memoryview. it's expected be reused by other kv backends

@tianyi-ge

tianyi-ge commented Dec 25, 2025

Copy link
Copy Markdown
Author

I'm afraid it's impossible to achieve the following targets at the same time

  1. use current set(keys, valus) where vals are memoryview or bytes or bytearray or str, and get(keys) -> list[bytes]
  2. support custom zero-copy ser/des for certain types

The proof: if we want to support custom ser/des, we need metadata and type id to decide which deserializer to use on a bytes. These extra information can be accessed by either 1. concat with real value; 2. store in another kv pair. Neither of them is an ideal choice.

  1. concat still causes copy. e.g. np.ndarray -> concat with metadata as one big bytes -> set
  2. new kv pair doubles the kv management pressure of control plane

what we need is something like np.ndarray --(directly write into)--> rdma memory region, only one copy. It's possible if set can provide a buffer from registered rdma memory region for users to write their array and metadata into. It's a feasible way since yuanrong already supports this in c++ but not in python.

zmq can support this because of send_multipart. It transfers multiple contiguous memory at the same time. The receiver can parse some of them as metadata and some of them as real data, so that custom deserializer is possible. This is another way to tackle: set supports one key -> multiple contiguous memory (e.g. metadata and real data memoryview).

In summary, two feasible ways to achieve custom zero-copy deserializer with kv system:

  1. kv system supports create_buffer + set semantics: directly write to memory region, only one copy
metadata = pickle.dumps(prepare_metadata(array))
size = len(metadata) + len(array)
buffers = mcreate([key], [size])
buffer = buffers[0]
buffer[:len(metadata)] = metadata
buffer[len(metadata)] = memoryview(array)
kv_client.mset(buffers)
  1. kv system supports key : list[val] to avoid concat on user side
metadata = pickle.dumps(prepare_metadata(array))
kv_client.mset([key], [[metadata, memoryview(array)]])
image image image

@dpj135

dpj135 commented Dec 25, 2025

Copy link
Copy Markdown
Contributor

I'm afraid it's impossible to achieve the following targets at the same time

  1. use current set(keys, valus) where vals are memoryview or bytes or bytearray or str, and get(keys) -> list[bytes]
  2. support custom zero-copy ser/des for certain types

The proof: if we want to support custom ser/des, we need metadata and type id to decide which deserializer to use on a bytes. These extra information can be accessed by either 1. concat with real value; 2. store in another kv pair. Neither of them is an ideal choice.

  1. concat still causes copy. e.g. np.ndarray -> concat with metadata as one big bytes -> set
  2. new kv pair doubles the kv management pressure of control plane

what we need is something like np.ndarray --(directly write into)--> rdma memory region, only one copy. It's possible if set can provide a buffer from registered rdma memory region for users to write their array and metadata into. It's a feasible way since yuanrong already supports this in c++ but not in python.

zmq can support this because of send_multipart. It transfers multiple contiguous memory at the same time. The receiver can parse some of them as metadata and some of them as real data, so that custom deserializer is possible. This is another way to tackle: set` supports one key -> multiple contiguous memory (e.g. metadata and real data memoryview).

In summary, two feasible ways to achieve custom zero-copy deserializer with kv system:

  1. kv system supports create_buffer + set semantics: directly write to memory region, only one copy
  2. kv system supports key : list[val] to avoid concat on user side.

image image image

Could we transfer metadata into tensor?And then add metadata to tensordata using view or others if memoryview support non-contiguous tensor.

@tianyi-ge

Copy link
Copy Markdown
Author

Could we transfer metadata into tensor?And then add metadata to tensordata using view or others if memoryview support non-contiguous tensor.

so far I can't think of a good idea, becausememoryview can only support non-contiguous memory segments that are originally from one big large memory segment (e.g. memoryview(data[::2])), but memoryview(array, metadata) will fail.

@0oshowero0

Copy link
Copy Markdown
Member
  1. kv system supports create_buffer + set semantics: directly write to memory region, only one copy

For this approach, how can the receiver deserialize the data? (how can we know the size of metadata part)?

@0oshowero0

0oshowero0 commented Dec 25, 2025

Copy link
Copy Markdown
Member

I'm not sure what is the drawback of this implementation: we store the metadata and give it a dedicated key (by appending a suffix like _meta/_data to the original key), treating it as a separate object alongside the raw data.

This will essentially double the total key-value pairs in the system, and we need to ensure atomicity between the _meta and _data keys—if only one of them is successfully operated (e.g., data is updated but metadata fails to write), it will result in invalid metadata-data mappings, leading to deserialization failures or data corruption.

Is that acceptable?

metadata = pickle.dumps(prepare_metadata(array))
kv_client.put(keys=['key_meta', key_data], values=[metadata, array])

@tianyi-ge

Copy link
Copy Markdown
Author

For this approach, how can the receiver deserialize the data? (how can we know the size of metadata part)?

we can either set the length of metadata as a constant, or prepend the length of metadata at the beginning. for example, metadata_length (uint32)

mset([key], [[metadata_length, metadata, array]])

a little dirty :)

@tianyi-ge

Copy link
Copy Markdown
Author

Ya I'll test if meta suffix significantly affects the latency/throughput of yuanrong

@tianyi-ge

tianyi-ge commented Dec 29, 2025

Copy link
Copy Markdown
Author

I wrote a sample use case of one-copy set/get on CPU side. The following code will be written in tq and hopefully can be merged to yuanrong

def mset_zcopy(keys: list[str], objects: list[Any]):
	# prepare metadata and memoryview
	for obj in objects:
		md, mv = serialize(obj) # if any, pickle
		serialized_metas.append(pickle.dumps(md))
		memoryviews.append(mv)
		metalen = len(serialized_metas[-1]) # HEADER_SIZE = 4 bytes
		sizes.append(HEADER_SIZE + metalen + len(mv))
		# | HEADER | SERIALIZED_META | MEMORYVIEW |

	# copy metadata and memoryviews to rdma buffer
	buffers = kv_client.mcreate(keys, sizes)
	for buffer, md, mv in zip(buffers, serialized_metas, memoryviews):
		fill_header(buffer, len(md))
		buffer[HEADER_SIZE : HEADER_SIZE + len(HEADER_SIZE)] = md
		cursor = HEADER_SIZE + len(HEADER_SIZE)
		buffer[cursor : cursor + len(mv)] = mv # one copy

	kv_client.mset(buffers)

def mget_zcopy(keys: list[str]) -> list[Any]:
	values = kv_client.get(keys)
	for val in values:
		metalen = parse_header(val)
		md = pickle.loads(val[HEADER_SIZE : HEADER_SIZE + metalen])
		mv = memoryview(val[HEADER_SIZE + metalen :])
		objects.append(deserialize(md, mv))
	return objects

where serialize can utilize pickle.dumps(protocol=pickle.HIGHEST_PROTOCOL)

def serialize(obj):
	mvs = []
	meta = pickle.dumps(obj, buffer_callback=mvs.append, protocol=pickle.HIGHEST_PROTOCOL)
	assert len(mvs) == 1
	return meta, memoryview(mv[0])

For simplicity, I assume only one contiguous memoryview. To support objects with multiple contiguous chunks, some logic needs to be modified.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants