Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Note 🎁 Summarized by CodeRabbit FreeYour organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login. Comment |
|
There is a pubic utils for zero-copy serialization in #140 , maybe it can be directly used here so that all the backends can share the same set of serial/deserial utils to improve maintainability. |
Yep I noticed this utility. As I mentioned in the PR description, yuanrong kv client expects a memoryview as an argument. Is it possible to extract the memoryview from the return value of |
|
I see. There is a little bit tricky since both |
|
Need to make sure all these scenarios are considered: nested_tensor = torch.nested.as_nested_tensor(
[torch.randn(4, 3), torch.randn(2, 4)], layout=torch.strided
),
jagged_tensor= torch.nested.as_nested_tensor(
[torch.randn(4, 5), torch.randn(4, 54)], layout=torch.jagged
),Theoretically, these nested tensors should be splited into single tensors during making values. But in practice, I recommend add more check to make sure there is not nested tensor trying to use memory view.
|
|
After discussion, we agree that The test result shows that zcopy brings 5~20% mitigation of total latency of put, and 2x get throughput compared to |
| if isinstance(obj, torch.Tensor) and obj.is_contiguous(): | ||
| return memoryview(obj.numpy()) | ||
| try: | ||
| return memoryview(obj) |
There was a problem hiding this comment.
It seems that memoryview(obj.numpy()) can also be used for non-contiguous tensors. Is its performance better than pickle.dump()?
There was a problem hiding this comment.
It might be dangerous. We observe vLLM zero-copy implementation make it continues before transport: https://github.com/vllm-project/vllm/blob/main/vllm/v1/serial_utils.py#L227
| else: | ||
| cpu_keys.append(key) | ||
| cpu_values.append(pickle.dumps(value)) | ||
| cpu_values.append(try_zero_copy_serialize(value)) |
There was a problem hiding this comment.
Need to modify _batch_get() synchronously.
There was a problem hiding this comment.
This might require a method to determine whether the data is processed via memoryviewor pickle.
There was a problem hiding this comment.
you are right. the metadata of each value is required to deserialize various data types. maybe we can add a metadata head field in front of each value, and write new ser/des functions to deal with all kv storages (yuanrong, mooncake, etc.)
for example,
def kv_storage_serialize(obj) -> memoryview | bytes | bytearray | str:
# determine which serialization method is the best
# prepend metadata and deserialization method id in front of main data
def kv_storage_deserialize(values: bytes):
# parse metadata and get deserialization method id
# choose the corresponding deserializer
# remove the metadata head and apply on the main data
one concern is that if we prepend metadata (with a certain length), it may worsen the zero-copy performance.
| else: | ||
| # All data goes through CPU path | ||
| pickled_values = [pickle.dumps(v) for v in values] | ||
| pickled_values = [try_zero_copy_serialize(v) for v in values] |
|
btw, where is the corresponding deserialization for memory view zero copy? |
you are right, I'm redesigning deserialization to make it work for both pickled bytes and |
Could we transfer metadata into tensor?And then add metadata to tensordata using |
so far I can't think of a good idea, because |
For this approach, how can the receiver deserialize the data? (how can we know the size of metadata part)? |
|
I'm not sure what is the drawback of this implementation: we store the metadata and give it a dedicated key (by appending a suffix like This will essentially double the total key-value pairs in the system, and we need to ensure atomicity between the Is that acceptable? metadata = pickle.dumps(prepare_metadata(array))
kv_client.put(keys=['key_meta', key_data], values=[metadata, array]) |
we can either set the length of metadata as a constant, or prepend the length of metadata at the beginning. for example, metadata_length (uint32) mset([key], [[metadata_length, metadata, array]])a little dirty :) |
|
Ya I'll test if meta suffix significantly affects the latency/throughput of yuanrong |
|
I wrote a sample use case of one-copy set/get on CPU side. The following code will be written in tq and hopefully can be merged to yuanrong def mset_zcopy(keys: list[str], objects: list[Any]):
# prepare metadata and memoryview
for obj in objects:
md, mv = serialize(obj) # if any, pickle
serialized_metas.append(pickle.dumps(md))
memoryviews.append(mv)
metalen = len(serialized_metas[-1]) # HEADER_SIZE = 4 bytes
sizes.append(HEADER_SIZE + metalen + len(mv))
# | HEADER | SERIALIZED_META | MEMORYVIEW |
# copy metadata and memoryviews to rdma buffer
buffers = kv_client.mcreate(keys, sizes)
for buffer, md, mv in zip(buffers, serialized_metas, memoryviews):
fill_header(buffer, len(md))
buffer[HEADER_SIZE : HEADER_SIZE + len(HEADER_SIZE)] = md
cursor = HEADER_SIZE + len(HEADER_SIZE)
buffer[cursor : cursor + len(mv)] = mv # one copy
kv_client.mset(buffers)
def mget_zcopy(keys: list[str]) -> list[Any]:
values = kv_client.get(keys)
for val in values:
metalen = parse_header(val)
md = pickle.loads(val[HEADER_SIZE : HEADER_SIZE + metalen])
mv = memoryview(val[HEADER_SIZE + metalen :])
objects.append(deserialize(md, mv))
return objectswhere def serialize(obj):
mvs = []
meta = pickle.dumps(obj, buffer_callback=mvs.append, protocol=pickle.HIGHEST_PROTOCOL)
assert len(mvs) == 1
return meta, memoryview(mv[0])For simplicity, I assume only one contiguous memoryview. To support objects with multiple contiguous chunks, some logic needs to be modified. |






Background
For yuanrong-datasystem
kv_client, it providesmsetthat supports cpu contiguous objects e.g. memoryview, bytes, and bytearray. The current solution ispickle.dumpseverything on cpu, which in fact unnecessarily allocates and calculates serialization for contiguous objects e.g. tensor, view tensor, numpy array, etc.Solution
This PR try to recognize common data types that can be passed to
msetwithout copy. One limitation here is that user tensor/view tensor cannot be modified during transfer.Notice
try_zero_copy_serializehas to be read-only and kept alive as long as the key not deleted, or in other words potential transfers may happen.serial_utilsbecause it's suitable for a mixed tensor dict. For yuanrongkvclient.mset, one key-value pair cannot handle a flattened tensor list. A value can either bememoryview-ed orpickle.dumps-ed as a whole.