Replica local seq_id can diverge during snapshot catchup and affect tied sort pagination
Bug Description
When all user supplied sort fields tie, Typesense falls back to the internal seq_id as the final tiebreaker. That seq_id is assigned locally by each replica when it applies a write. The serialized Raft log entry carries the document payload, but not a leader assigned seq_id.
This means a follower can end up with a different seq_id for the same external document id if it applies an upsert against a local store state where that id is missing, while the other replicas apply the same upsert as an update and reuse the existing seq_id.
The first script below demonstrates the origin path deterministically: node 3 loads a snapshot that is missing one document, then catches up an upsert for that same document from Raft. Nodes 1 and 2 reuse the old seq_id; node 3 allocates a fresh one during catchup. The second script demonstrates the visible consequence once the drift exists: same user fields on all nodes, but different tied sort order on node 3.
Both reproducers are fully local and synthetic. They use Docker, curl, jq, and a temporary local rocksdb-tools container. They do not connect to any remote cluster.
Reproduction Steps
Reproducer A: create seq_id drift during snapshot catchup
Save as reproduce_snapshot_catchup_seq_id_drift.sh, then run:
bash reproduce_snapshot_catchup_seq_id_drift.sh
Expected ending:
BUG REPRODUCED: seq_id drift was created during snapshot load plus Raft catchup.
Same user fields, same committed upsert, different hidden seq_id tiebreaker.
Reproducer A full script
#!/bin/bash
# Issue: seq_id drift is created during follower snapshot catchup
# Typesense Version: 30.2
# Description:
# Reproduces the origin path for seq-id-repro locally. Node 3 is restarted from a
# local Raft snapshot that is missing one document, while an upsert for that
# same document is committed on nodes 1 and 2 while node 3 is offline. When
# node 3 starts, it loads the snapshot and catches up the committed upsert from
# Raft. Nodes 1 and 2 apply that upsert as an update and reuse the old seq_id.
# Node 3 applies the same log entry against a store where the id is missing,
# takes the create branch, allocates a fresh seq_id, and then returns a
# different order for tied sorts.
set -euo pipefail
# ============================================================================
# CONFIGURATION
# ============================================================================
TYPESENSE_API_KEY=xyz
VERSION=${VERSION:-30.2}
COLLECTION=screens
CONTAINER_NAME=typesense-issue-seq-id-repro-seq-id-drift-during-snapshot-catchup
NET=typesense-seqid-origin-net
SUBNET=10.245.0.0/24
NODES_CONF="10.245.0.11:8107:8108,10.245.0.12:8107:8108,10.245.0.13:8107:8108"
DOC_COUNT=80
TARGET_ID=doc-010
TARGET_TITLE="screen 010 catchup"
EXPECTED_TARGET_SEQ_ID=9
COLLECTION_ID=0
ROCKSDB_TOOLS_IMAGE=typesense-local-rocksdb-tools:bookworm
IP=("_" 10.245.0.11 10.245.0.12 10.245.0.13)
HOSTPORT=(0 8418 8428 8438)
NAME=("_" ts-seqid-origin-n1 ts-seqid-origin-n2 ts-seqid-origin-n3)
WORKDIR=$(pwd)
DOCKER_NOCONV=""
case "$(uname -s)" in
MINGW*|MSYS*|CYGWIN*)
DOCKER_NOCONV="env MSYS_NO_PATHCONV=1"
WORKDIR=$(pwd -W)
;;
esac
DATA_ROOT="${WORKDIR}/typesense-data-${CONTAINER_NAME}"
DOCS_FILE="${WORKDIR}/docs-origin.jsonl"
RESULT_DIR="${WORKDIR}/results-origin"
# ============================================================================
# CLEANUP FUNCTION
# ============================================================================
cleanup() {
echo ""
echo "=== Cleanup ==="
for i in 1 2 3; do
docker rm -f "${NAME[$i]}" >/dev/null 2>&1 || true
done
docker network rm "${NET}" >/dev/null 2>&1 || true
rm -rf "${DATA_ROOT}" "${DOCS_FILE}" "${RESULT_DIR}"
echo "Cleanup complete"
}
trap cleanup EXIT
require_cmd() {
command -v "$1" >/dev/null 2>&1 || {
echo "ERROR: required command not found: $1"
exit 127
}
}
# ============================================================================
# HELPERS
# ============================================================================
host_for_node() {
local node=$1
printf 'http://localhost:%s' "${HOSTPORT[$node]}"
}
api() {
local node=$1 method=$2 path=$3
shift 3
curl -sS -X "${method}" "$(host_for_node "${node}")${path}" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "$@"
}
wait_for_health() {
local nodes_csv=$1 max_wait=${2:-90}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ok=0 total=0
IFS=',' read -ra nodes <<< "${nodes_csv}"
for node in "${nodes[@]}"; do
total=$((total + 1))
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
"$(host_for_node "${node}")/health" 2>/dev/null || true)
[ "${code}" = "200" ] && ok=$((ok + 1))
done
[ "${ok}" -eq "${total}" ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: nodes ${nodes_csv} did not become healthy"
for i in 1 2 3; do
echo "--- ${NAME[$i]} logs ---"
docker logs --tail 40 "${NAME[$i]}" 2>&1 || true
done
exit 2
}
wait_for_collection() {
local collection=$1 max_wait=${2:-60}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ready=0
for node in 1 2 3; do
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
"$(host_for_node "${node}")/collections/${collection}" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null || true)
[ "${code}" = "200" ] && ready=$((ready + 1))
done
[ "${ready}" -eq 3 ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: collection ${collection} was not visible on all nodes"
exit 2
}
collection_count() {
local node=$1
api "${node}" GET "/collections/${COLLECTION}" | jq -r '.num_documents // -1'
}
wait_for_count() {
local nodes_csv=$1 expected=$2 max_wait=${3:-90}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ok=0 total=0
IFS=',' read -ra nodes <<< "${nodes_csv}"
for node in "${nodes[@]}"; do
total=$((total + 1))
count=$(collection_count "${node}" 2>/dev/null || echo -1)
[ "${count}" = "${expected}" ] && ok=$((ok + 1))
done
[ "${ok}" -eq "${total}" ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: expected count ${expected} on nodes ${nodes_csv}"
for node in 1 2 3; do
echo " node ${node}: $(collection_count "${node}" 2>/dev/null || echo unavailable)"
done
exit 2
}
doc_title() {
local node=$1
api "${node}" GET "/collections/${COLLECTION}/documents/${TARGET_ID}" \
| jq -r '.title // ""'
}
wait_for_doc_title() {
local nodes_csv=$1 expected=$2 max_wait=${3:-90}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ok=0 total=0
IFS=',' read -ra nodes <<< "${nodes_csv}"
for node in "${nodes[@]}"; do
total=$((total + 1))
title=$(doc_title "${node}" 2>/dev/null || true)
[ "${title}" = "${expected}" ] && ok=$((ok + 1))
done
[ "${ok}" -eq "${total}" ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: expected ${TARGET_ID} title ${expected} on nodes ${nodes_csv}"
for node in 1 2 3; do
echo " node ${node}: $(doc_title "${node}" 2>/dev/null || echo unavailable)"
done
exit 2
}
start_node() {
local node=$1
local dir="${DATA_ROOT}/data-n${node}"
mkdir -p "${dir}"
printf '%s' "${NODES_CONF}" > "${dir}/nodes"
docker rm -f "${NAME[$node]}" >/dev/null 2>&1 || true
${DOCKER_NOCONV} docker run -d \
--name "${NAME[$node]}" \
--network "${NET}" \
--ip "${IP[$node]}" \
-p "${HOSTPORT[$node]}:8108" \
-v "${dir}:/data" \
"typesense/typesense:${VERSION}" \
--data-dir /data \
--api-key="${TYPESENSE_API_KEY}" \
--api-port 8108 \
--peering-port 8107 \
--peering-address "${IP[$node]}" \
--nodes /data/nodes \
--reset-peers-on-error=false >/dev/null
}
ensure_rocksdb_tools_image() {
if docker image inspect "${ROCKSDB_TOOLS_IMAGE}" >/dev/null 2>&1; then
return 0
fi
echo "=== Building local RocksDB tools image ==="
docker build -t "${ROCKSDB_TOOLS_IMAGE}" - <<'DOCKERFILE' >/dev/null
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y rocksdb-tools && rm -rf /var/lib/apt/lists/*
DOCKERFILE
}
hex_ascii() {
printf '%s' "$1" | od -An -tx1 | tr -d ' \n'
}
ldb_node3_db() {
local db_path=$1
shift
${DOCKER_NOCONV} docker run --rm \
-v "${DATA_ROOT}/data-n3:/data" \
"${ROCKSDB_TOOLS_IMAGE}" \
ldb --db="${db_path}" --try_load_options=false --ignore_unknown_options "$@"
}
latest_node3_snapshot_db() {
${DOCKER_NOCONV} docker run --rm \
-v "${DATA_ROOT}/data-n3:/data" \
"${ROCKSDB_TOOLS_IMAGE}" \
sh -lc "find /data/state/snapshot -type d -name db_snapshot 2>/dev/null | sort | tail -1"
}
delete_target_from_snapshot() {
local db_path=$1
local doc_key seq_prefix_hex seq_hex doc_hex seq_key_hex before_doc target_seq_id
doc_key="${COLLECTION_ID}_\$DI_${TARGET_ID}"
doc_hex=$(hex_ascii "${doc_key}")
before_doc=$(ldb_node3_db "${db_path}" get "${doc_key}" 2>/dev/null || true)
target_seq_id=$(printf '%s' "${before_doc}" | tr -d '[:space:]')
if ! printf '%s' "${target_seq_id}" | grep -Eq '^[0-9]+$'; then
echo "ERROR: expected snapshot doc_id key for ${TARGET_ID} to contain numeric seq_id"
echo "Computed doc key: ${doc_key}"
echo "Computed doc key hex: 0x${doc_hex}"
echo "Actual ldb output: ${before_doc}"
ldb_node3_db "${db_path}" scan --no_value --max_keys=80 || true
exit 2
fi
if [ "${target_seq_id}" != "${EXPECTED_TARGET_SEQ_ID}" ]; then
echo "WARNING: ${TARGET_ID} mapped to seq_id ${target_seq_id}, expected ${EXPECTED_TARGET_SEQ_ID}; using actual mapping."
fi
seq_prefix_hex=$(hex_ascii "${COLLECTION_ID}_\$SI_")
seq_hex=$(printf '%08x' "${target_seq_id}")
seq_key_hex="${seq_prefix_hex}${seq_hex}"
ldb_node3_db "${db_path}" delete "${doc_key}" >/dev/null
ldb_node3_db "${db_path}" --key_hex delete "0x${seq_key_hex}" >/dev/null
ldb_node3_db "${db_path}" compact >/dev/null || true
${DOCKER_NOCONV} docker run --rm -v "${DATA_ROOT}/data-n3:/data" "${ROCKSDB_TOOLS_IMAGE}" \
sh -lc "rm -f '${db_path}/LOCK'"
if ldb_node3_db "${db_path}" get "${doc_key}" >/dev/null 2>&1; then
echo "ERROR: snapshot still has ${doc_key} after deletion"
exit 2
fi
echo "Deleted ${TARGET_ID} from node 3 snapshot before catchup:"
echo " snapshot db ${db_path}"
echo " doc_id key ${doc_key}"
echo " seq_id ${target_seq_id}"
}
fetch_ids() {
local node=$1 sort_by=$2 per_page=${3:-12} page=${4:-1}
curl -sS -G "$(host_for_node "${node}")/collections/${COLLECTION}/documents/search" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
--data-urlencode "q=*" \
--data-urlencode "query_by=title" \
--data-urlencode "sort_by=${sort_by}" \
--data-urlencode "per_page=${per_page}" \
--data-urlencode "page=${page}" \
--data-urlencode "include_fields=id,metric,stable_rank,title" \
| jq -r '.hits[].document.id' | tr -d '\r' | paste -sd ','
}
fetch_doc() {
local node=$1
api "${node}" GET "/collections/${COLLECTION}/documents/${TARGET_ID}" \
| jq -c '{id, metric, stable_rank, title}'
}
assert_equal() {
local left=$1 right=$2 label=$3
if [ "${left}" != "${right}" ]; then
echo "ERROR: ${label}"
echo "LEFT : ${left}"
echo "RIGHT: ${right}"
exit 1
fi
}
# ============================================================================
# SETUP TYPESENSE
# ============================================================================
require_cmd docker
require_cmd curl
require_cmd jq
require_cmd od
require_cmd paste
cleanup >/dev/null 2>&1 || true
mkdir -p "${DATA_ROOT}" "${RESULT_DIR}"
echo "=== Pulling Typesense ${VERSION} ==="
docker pull "typesense/typesense:${VERSION}" >/dev/null
ensure_rocksdb_tools_image
echo "=== Starting 3 node local HA cluster ==="
docker network create --subnet="${SUBNET}" "${NET}" >/dev/null
for node in 1 2 3; do
start_node "${node}"
done
wait_for_health "1,2,3"
# ============================================================================
# CREATE COLLECTIONS
# ============================================================================
echo "=== Creating collection ==="
api 1 POST "/collections" \
-H "Content-Type: application/json" \
-d '{
"name":"screens",
"fields":[
{"name":"title","type":"string"},
{"name":"metric","type":"int32","sort":true},
{"name":"stable_rank","type":"int32","sort":true}
]
}' >/dev/null
wait_for_collection "${COLLECTION}"
# ============================================================================
# IMPORT DOCUMENTS
# ============================================================================
echo "=== Importing tied documents ==="
rm -f "${DOCS_FILE}"
for n in $(seq 1 "${DOC_COUNT}"); do
raw=$(printf '%03d' "${n}")
printf '{"id":"doc-%s","title":"screen %s","metric":1,"stable_rank":%d}\n' "${raw}" "${raw}" "${n}" >> "${DOCS_FILE}"
done
curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=create" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-H "Content-Type: text/plain" \
--data-binary @"${DOCS_FILE}" >/dev/null
wait_for_count "1,2,3" "${DOC_COUNT}"
baseline_1=$(fetch_ids 1 "metric:desc")
baseline_2=$(fetch_ids 2 "metric:desc")
baseline_3=$(fetch_ids 3 "metric:desc")
assert_equal "${baseline_1}" "${baseline_2}" "baseline node 1 and 2 order differ"
assert_equal "${baseline_1}" "${baseline_3}" "baseline node 1 and 3 order differ"
echo "Baseline tied sort order identical on all nodes."
echo "=== Creating clean node 3 snapshot ==="
api 3 POST "/operations/db/compact" >/dev/null
sleep 2
api 3 POST "/operations/snapshot" >/dev/null
SNAPSHOT_DB=$(latest_node3_snapshot_db)
if [ -z "${SNAPSHOT_DB}" ]; then
echo "ERROR: node 3 snapshot db was not created"
exit 2
fi
echo "Node 3 snapshot DB: ${SNAPSHOT_DB}"
# ============================================================================
# REPRODUCE THE ORIGIN PATH
# ============================================================================
echo "=== Stop node 3 and make its snapshot miss ${TARGET_ID} ==="
docker stop "${NAME[3]}" >/dev/null
docker rm "${NAME[3]}" >/dev/null
delete_target_from_snapshot "${SNAPSHOT_DB}"
echo "=== Commit upsert while node 3 is offline ==="
UPSERT_RESULT=$(curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=upsert" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-H "Content-Type: text/plain" \
--data-binary @- <<EOF
{"id":"${TARGET_ID}","title":"${TARGET_TITLE}","metric":1,"stable_rank":10}
EOF
)
if ! printf '%s' "${UPSERT_RESULT}" | jq -e 'select(.success == true)' >/dev/null; then
echo "ERROR: offline upsert failed"
printf '%s\n' "${UPSERT_RESULT}"
exit 2
fi
wait_for_doc_title "1,2" "${TARGET_TITLE}"
echo "=== Restart node 3 so snapshot load plus Raft catchup applies the upsert ==="
start_node 3
wait_for_health "1,2,3"
wait_for_count "1,2,3" "${DOC_COUNT}"
wait_for_doc_title "1,2,3" "${TARGET_TITLE}"
if ! docker logs "${NAME[3]}" 2>&1 | grep -q "on_snapshot_load"; then
echo "ERROR: node 3 did not log on_snapshot_load"
docker logs --tail 80 "${NAME[3]}" 2>&1 || true
exit 2
fi
# ============================================================================
# RESULTS OUTPUT
# ============================================================================
doc_1=$(fetch_doc 1)
doc_2=$(fetch_doc 2)
doc_3=$(fetch_doc 3)
assert_equal "${doc_1}" "${doc_2}" "node 1 and 2 target fields differ after catchup"
assert_equal "${doc_1}" "${doc_3}" "node 1 and 3 target fields differ after catchup"
tied_1=$(fetch_ids 1 "metric:desc")
tied_2=$(fetch_ids 2 "metric:desc")
tied_3=$(fetch_ids 3 "metric:desc")
stable_1=$(fetch_ids 1 "metric:desc,stable_rank:desc")
stable_2=$(fetch_ids 2 "metric:desc,stable_rank:desc")
stable_3=$(fetch_ids 3 "metric:desc,stable_rank:desc")
assert_equal "${tied_1}" "${tied_2}" "node 1 and 2 tied sort order differ"
assert_equal "${stable_1}" "${stable_2}" "node 1 and 2 deterministic sort order differ"
assert_equal "${stable_1}" "${stable_3}" "deterministic secondary sort did not realign nodes"
printf '%s\n' "${tied_1}" > "${RESULT_DIR}/node1-tied-order.txt"
printf '%s\n' "${tied_2}" > "${RESULT_DIR}/node2-tied-order.txt"
printf '%s\n' "${tied_3}" > "${RESULT_DIR}/node3-tied-order.txt"
printf '%s\n' "${stable_1}" > "${RESULT_DIR}/node1-stable-order.txt"
printf '%s\n' "${stable_3}" > "${RESULT_DIR}/node3-stable-order.txt"
echo ""
echo "=== Result ==="
echo "Node 3 loaded a snapshot, then replayed an upsert committed while it was offline."
echo "Target document fields after catchup:"
echo " node 1: ${doc_1}"
echo " node 2: ${doc_2}"
echo " node 3: ${doc_3}"
echo ""
echo "Tied sort metric:desc, first 12 ids:"
echo " node 1: ${tied_1}"
echo " node 2: ${tied_2}"
echo " node 3: ${tied_3}"
echo ""
echo "Deterministic sort metric:desc,stable_rank:desc, first 12 ids:"
echo " node 1: ${stable_1}"
echo " node 3: ${stable_3}"
echo ""
if [ "${tied_1}" = "${tied_3}" ]; then
echo "NOT REPRODUCED: tied sort order stayed identical."
exit 1
fi
if ! printf '%s' "${tied_3}" | grep -q "^${TARGET_ID},"; then
echo "NOT REPRODUCED: node 3 order changed, but ${TARGET_ID} did not move to the tied group front."
exit 1
fi
echo "BUG REPRODUCED: seq_id drift was created during snapshot load plus Raft catchup."
echo "Same user fields, same committed upsert, different hidden seq_id tiebreaker."
Reproducer B: visible tied sort consequence once drift exists
Save as reproduce_seq_id_tied_sort_drift.sh, then run:
bash reproduce_seq_id_tied_sort_drift.sh
Expected ending:
BUG REPRODUCED: node 3 returns a different tied sort order even though user fields match.
This demonstrates seq_id as the hidden tied-sort tiebreaker and why a deterministic secondary sort fixes pagination boundaries.
Reproducer B full script
#!/bin/bash
# Issue: Replica local seq_id drift changes tied sort order
# Typesense Version: 30.2
# Description:
# Reproduces the customer visible symptom from seq-id-repro locally. Typesense uses
# the internal seq_id as the final tiebreaker after user supplied sort fields.
# The seq_id is derived on each replica at apply time and is not carried in
# the Raft log entry. This script creates a 3 node local HA cluster, creates a
# node 3 Raft snapshot, removes one document from that snapshot's RocksDB
# state, restarts node 3 from the mutated snapshot, then sends one replicated
# upsert for that same id. Nodes 1 and 2 reuse the original seq_id; node 3
# takes the new document branch and allocates a fresh seq_id. All user fields
# match after the upsert, but direct node searches sorted only by the tied
# metric produce a different order on node 3.
set -euo pipefail
# ============================================================================
# CONFIGURATION
# ============================================================================
TYPESENSE_API_KEY=xyz
VERSION=${VERSION:-30.2}
COLLECTION=screens
CONTAINER_NAME=typesense-issue-seq-id-repro-seq-id-tiebreaker-drift
NET=typesense-seqid-net
SUBNET=10.244.0.0/24
NODES_CONF="10.244.0.11:8107:8108,10.244.0.12:8107:8108,10.244.0.13:8107:8108"
DOC_COUNT=80
TARGET_ID=doc-010
EXPECTED_TARGET_SEQ_ID=9
COLLECTION_ID=0
ROCKSDB_TOOLS_IMAGE=typesense-local-rocksdb-tools:bookworm
IP=("_" 10.244.0.11 10.244.0.12 10.244.0.13)
HOSTPORT=(0 8318 8328 8338)
NAME=("_" ts-seqid-n1 ts-seqid-n2 ts-seqid-n3)
WORKDIR=$(pwd)
DOCKER_NOCONV=""
case "$(uname -s)" in
MINGW*|MSYS*|CYGWIN*)
DOCKER_NOCONV="env MSYS_NO_PATHCONV=1"
WORKDIR=$(pwd -W)
;;
esac
DATA_ROOT="${WORKDIR}/typesense-data-${CONTAINER_NAME}"
DOCS_FILE="${WORKDIR}/docs.jsonl"
RESULT_DIR="${WORKDIR}/results"
# ============================================================================
# CLEANUP FUNCTION
# ============================================================================
cleanup() {
echo ""
echo "=== Cleanup ==="
for i in 1 2 3; do
docker rm -f "${NAME[$i]}" >/dev/null 2>&1 || true
done
docker network rm "${NET}" >/dev/null 2>&1 || true
rm -rf "${DATA_ROOT}" "${DOCS_FILE}" "${RESULT_DIR}"
echo "Cleanup complete"
}
trap cleanup EXIT
require_cmd() {
command -v "$1" >/dev/null 2>&1 || {
echo "ERROR: required command not found: $1"
exit 127
}
}
# ============================================================================
# HELPERS
# ============================================================================
host_for_node() {
local node=$1
printf 'http://localhost:%s' "${HOSTPORT[$node]}"
}
api() {
local node=$1 method=$2 path=$3
shift 3
curl -sS -X "${method}" "$(host_for_node "${node}")${path}" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "$@"
}
wait_for_health() {
local nodes_csv=$1 max_wait=${2:-90}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ok=0 total=0
IFS=',' read -ra nodes <<< "${nodes_csv}"
for node in "${nodes[@]}"; do
total=$((total + 1))
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
"$(host_for_node "${node}")/health" 2>/dev/null || true)
[ "${code}" = "200" ] && ok=$((ok + 1))
done
[ "${ok}" -eq "${total}" ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: nodes ${nodes_csv} did not become healthy"
for i in 1 2 3; do
echo "--- ${NAME[$i]} logs ---"
docker logs --tail 30 "${NAME[$i]}" 2>&1 || true
done
exit 2
}
wait_for_collection() {
local collection=$1 max_wait=${2:-60}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ready=0
for node in 1 2 3; do
code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
"$(host_for_node "${node}")/collections/${collection}" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null || true)
[ "${code}" = "200" ] && ready=$((ready + 1))
done
[ "${ready}" -eq 3 ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: collection ${collection} was not visible on all nodes"
exit 2
}
collection_count() {
local node=$1
api "${node}" GET "/collections/${COLLECTION}" | jq -r '.num_documents // -1'
}
wait_for_count() {
local nodes_csv=$1 expected=$2 max_wait=${3:-90}
local waited=0
while [ "${waited}" -lt "${max_wait}" ]; do
local ok=0 total=0
IFS=',' read -ra nodes <<< "${nodes_csv}"
for node in "${nodes[@]}"; do
total=$((total + 1))
count=$(collection_count "${node}" 2>/dev/null || echo -1)
[ "${count}" = "${expected}" ] && ok=$((ok + 1))
done
[ "${ok}" -eq "${total}" ] && return 0
sleep 1
waited=$((waited + 1))
done
echo "ERROR: expected count ${expected} on nodes ${nodes_csv}"
for node in 1 2 3; do
echo " node ${node}: $(collection_count "${node}" 2>/dev/null || echo unavailable)"
done
exit 2
}
start_node() {
local node=$1
local dir="${DATA_ROOT}/data-n${node}"
mkdir -p "${dir}"
printf '%s' "${NODES_CONF}" > "${dir}/nodes"
docker rm -f "${NAME[$node]}" >/dev/null 2>&1 || true
${DOCKER_NOCONV} docker run -d \
--name "${NAME[$node]}" \
--network "${NET}" \
--ip "${IP[$node]}" \
-p "${HOSTPORT[$node]}:8108" \
-v "${dir}:/data" \
"typesense/typesense:${VERSION}" \
--data-dir /data \
--api-key="${TYPESENSE_API_KEY}" \
--api-port 8108 \
--peering-port 8107 \
--peering-address "${IP[$node]}" \
--nodes /data/nodes \
--reset-peers-on-error=false >/dev/null
}
ensure_rocksdb_tools_image() {
if docker image inspect "${ROCKSDB_TOOLS_IMAGE}" >/dev/null 2>&1; then
return 0
fi
echo "=== Building local RocksDB tools image ==="
docker build -t "${ROCKSDB_TOOLS_IMAGE}" - <<'DOCKERFILE' >/dev/null
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y rocksdb-tools && rm -rf /var/lib/apt/lists/*
DOCKERFILE
}
hex_ascii() {
printf '%s' "$1" | od -An -tx1 | tr -d ' \n'
}
ldb_node3_db() {
local db_path=$1
shift
${DOCKER_NOCONV} docker run --rm \
-v "${DATA_ROOT}/data-n3:/data" \
"${ROCKSDB_TOOLS_IMAGE}" \
ldb --db="${db_path}" --try_load_options=false --ignore_unknown_options "$@"
}
ldb_node3() {
ldb_node3_db /data/db "$@"
}
latest_node3_snapshot_db() {
${DOCKER_NOCONV} docker run --rm \
-v "${DATA_ROOT}/data-n3:/data" \
"${ROCKSDB_TOOLS_IMAGE}" \
sh -lc "find /data/state/snapshot -type d -name db_snapshot 2>/dev/null | sort | tail -1"
}
delete_target_from_node3_store() {
local db_path=${1:-/data/db}
local doc_key seq_prefix_hex seq_hex doc_hex seq_key_hex before_doc target_seq_id
doc_key="${COLLECTION_ID}_\$DI_${TARGET_ID}"
doc_hex=$(hex_ascii "${doc_key}")
before_doc=$(ldb_node3_db "${db_path}" get "${doc_key}" 2>/dev/null || true)
target_seq_id=$(printf '%s' "${before_doc}" | tr -d '[:space:]')
if ! printf '%s' "${target_seq_id}" | grep -Eq '^[0-9]+$'; then
echo "ERROR: expected node 3 doc_id key for ${TARGET_ID} to contain numeric seq_id"
echo "Computed doc key: ${doc_key}"
echo "Computed doc key hex: 0x${doc_hex}"
echo "Actual ldb output: ${before_doc}"
echo "Node 3 data directory:"
${DOCKER_NOCONV} docker run --rm -v "${DATA_ROOT}/data-n3:/data" "${ROCKSDB_TOOLS_IMAGE}" \
sh -lc "find /data -maxdepth 2 -type f | sort | head -40" || true
echo "Node 3 key sample:"
ldb_node3_db "${db_path}" scan --no_value --max_keys=60 || true
exit 2
fi
if [ "${target_seq_id}" != "${EXPECTED_TARGET_SEQ_ID}" ]; then
echo "WARNING: ${TARGET_ID} mapped to seq_id ${target_seq_id}, expected ${EXPECTED_TARGET_SEQ_ID}; using actual mapping."
fi
seq_prefix_hex=$(hex_ascii "${COLLECTION_ID}_\$SI_")
seq_hex=$(printf '%08x' "${target_seq_id}")
seq_key_hex="${seq_prefix_hex}${seq_hex}"
ldb_node3_db "${db_path}" delete "${doc_key}" >/dev/null
ldb_node3_db "${db_path}" --key_hex delete "0x${seq_key_hex}" >/dev/null
ldb_node3_db "${db_path}" compact >/dev/null || true
${DOCKER_NOCONV} docker run --rm -v "${DATA_ROOT}/data-n3:/data" "${ROCKSDB_TOOLS_IMAGE}" \
sh -lc "rm -f '${db_path}/LOCK'"
echo "Removed node 3 local keys for ${TARGET_ID}:"
echo " db path ${db_path}"
echo " doc_id key ${doc_key}"
echo " seq_id ${target_seq_id}"
}
fetch_ids() {
local node=$1 sort_by=$2 per_page=${3:-12} page=${4:-1}
curl -sS -G "$(host_for_node "${node}")/collections/${COLLECTION}/documents/search" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
--data-urlencode "q=*" \
--data-urlencode "query_by=title" \
--data-urlencode "sort_by=${sort_by}" \
--data-urlencode "per_page=${per_page}" \
--data-urlencode "page=${page}" \
--data-urlencode "include_fields=id,metric,stable_rank,title" \
| jq -r '.hits[].document.id' | tr -d '\r' | paste -sd ','
}
fetch_doc() {
local node=$1
api "${node}" GET "/collections/${COLLECTION}/documents/${TARGET_ID}" \
| jq -c '{id, metric, stable_rank, title}'
}
assert_equal() {
local left=$1 right=$2 label=$3
if [ "${left}" != "${right}" ]; then
echo "ERROR: ${label}"
echo "LEFT : ${left}"
echo "RIGHT: ${right}"
exit 1
fi
}
# ============================================================================
# SETUP TYPESENSE
# ============================================================================
require_cmd docker
require_cmd curl
require_cmd jq
require_cmd od
require_cmd paste
cleanup >/dev/null 2>&1 || true
mkdir -p "${DATA_ROOT}" "${RESULT_DIR}"
echo "=== Pulling Typesense ${VERSION} ==="
docker pull "typesense/typesense:${VERSION}" >/dev/null
ensure_rocksdb_tools_image
echo "=== Starting 3 node local HA cluster ==="
docker network create --subnet="${SUBNET}" "${NET}" >/dev/null
for node in 1 2 3; do
start_node "${node}"
done
wait_for_health "1,2,3"
# ============================================================================
# CREATE COLLECTIONS
# ============================================================================
echo "=== Creating collection ==="
api 1 POST "/collections" \
-H "Content-Type: application/json" \
-d '{
"name":"screens",
"fields":[
{"name":"title","type":"string"},
{"name":"metric","type":"int32","sort":true},
{"name":"stable_rank","type":"int32","sort":true}
]
}' >/dev/null
wait_for_collection "${COLLECTION}"
# ============================================================================
# IMPORT DOCUMENTS
# ============================================================================
echo "=== Importing tied documents ==="
rm -f "${DOCS_FILE}"
for n in $(seq 1 "${DOC_COUNT}"); do
raw=$(printf '%03d' "${n}")
printf '{"id":"doc-%s","title":"screen %s","metric":1,"stable_rank":%d}\n' "${raw}" "${raw}" "${n}" >> "${DOCS_FILE}"
done
curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=create" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-H "Content-Type: text/plain" \
--data-binary @"${DOCS_FILE}" >/dev/null
wait_for_count "1,2,3" "${DOC_COUNT}"
baseline_1=$(fetch_ids 1 "metric:desc")
baseline_2=$(fetch_ids 2 "metric:desc")
baseline_3=$(fetch_ids 3 "metric:desc")
assert_equal "${baseline_1}" "${baseline_2}" "baseline node 1 and 2 order differ"
assert_equal "${baseline_1}" "${baseline_3}" "baseline node 1 and 3 order differ"
echo "Baseline tied sort order identical on all nodes."
echo "=== Compacting node 3 RocksDB before local mutation ==="
api 3 POST "/operations/db/compact" >/dev/null
sleep 2
echo "=== Creating node 3 local Raft snapshot ==="
api 3 POST "/operations/snapshot" >/dev/null
SNAPSHOT_DB=$(latest_node3_snapshot_db)
if [ -z "${SNAPSHOT_DB}" ]; then
echo "ERROR: node 3 snapshot db was not created"
exit 2
fi
echo "Node 3 snapshot DB: ${SNAPSHOT_DB}"
# ============================================================================
# REPRODUCE THE ISSUE
# ============================================================================
echo "=== Simulating node 3 snapshot miss for ${TARGET_ID} ==="
docker stop "${NAME[3]}" >/dev/null
docker rm "${NAME[3]}" >/dev/null
delete_target_from_node3_store "${SNAPSHOT_DB}"
echo "=== Restarting node 3 with target doc missing locally ==="
start_node 3
wait_for_health "1,2,3"
wait_for_count "1,2" "${DOC_COUNT}"
wait_for_count "3" "$((DOC_COUNT - 1))"
echo "=== Upserting ${TARGET_ID} through local Raft write path ==="
UPSERT_RESULT=$(curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=upsert" \
-H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
-H "Content-Type: text/plain" \
--data-binary @- <<EOF
{"id":"${TARGET_ID}","title":"screen 010 updated","metric":1,"stable_rank":10}
EOF
)
if ! printf '%s' "${UPSERT_RESULT}" | jq -e 'select(.success == true)' >/dev/null; then
echo "ERROR: upsert failed"
printf '%s\n' "${UPSERT_RESULT}"
exit 2
fi
wait_for_count "1,2,3" "${DOC_COUNT}"
doc_1=$(fetch_doc 1)
doc_2=$(fetch_doc 2)
doc_3=$(fetch_doc 3)
assert_equal "${doc_1}" "${doc_2}" "node 1 and 2 target fields differ after upsert"
assert_equal "${doc_1}" "${doc_3}" "node 1 and 3 target fields differ after upsert"
tied_1=$(fetch_ids 1 "metric:desc")
tied_2=$(fetch_ids 2 "metric:desc")
tied_3=$(fetch_ids 3 "metric:desc")
stable_1=$(fetch_ids 1 "metric:desc,stable_rank:desc")
stable_2=$(fetch_ids 2 "metric:desc,stable_rank:desc")
stable_3=$(fetch_ids 3 "metric:desc,stable_rank:desc")
assert_equal "${tied_1}" "${tied_2}" "node 1 and 2 tied sort order differ"
assert_equal "${stable_1}" "${stable_2}" "node 1 and 2 deterministic sort order differ"
assert_equal "${stable_1}" "${stable_3}" "deterministic secondary sort did not realign nodes"
printf '%s\n' "${tied_1}" > "${RESULT_DIR}/node1-tied-order.txt"
printf '%s\n' "${tied_2}" > "${RESULT_DIR}/node2-tied-order.txt"
printf '%s\n' "${tied_3}" > "${RESULT_DIR}/node3-tied-order.txt"
printf '%s\n' "${stable_1}" > "${RESULT_DIR}/node1-stable-order.txt"
printf '%s\n' "${stable_3}" > "${RESULT_DIR}/node3-stable-order.txt"
# ============================================================================
# RESULTS OUTPUT
# ============================================================================
echo ""
echo "=== Result ==="
echo "Target document fields after upsert:"
echo " node 1: ${doc_1}"
echo " node 2: ${doc_2}"
echo " node 3: ${doc_3}"
echo ""
echo "Tied sort metric:desc, first 12 ids:"
echo " node 1: ${tied_1}"
echo " node 2: ${tied_2}"
echo " node 3: ${tied_3}"
echo ""
echo "Deterministic sort metric:desc,stable_rank:desc, first 12 ids:"
echo " node 1: ${stable_1}"
echo " node 3: ${stable_3}"
echo ""
if [ "${tied_1}" = "${tied_3}" ]; then
echo "NOT REPRODUCED: tied sort order stayed identical."
exit 1
fi
if ! printf '%s' "${tied_3}" | grep -q "^${TARGET_ID},"; then
echo "NOT REPRODUCED: node 3 order changed, but ${TARGET_ID} did not move to the tied group front."
exit 1
fi
echo "BUG REPRODUCED: node 3 returns a different tied sort order even though user fields match."
echo "This demonstrates seq_id as the hidden tied-sort tiebreaker and why a deterministic secondary sort fixes pagination boundaries."
Expected vs Actual
Expected behavior
A committed upsert should not produce replica divergent hidden tiebreaker state for the same document id. If two replicas have the same user visible fields for the same documents, paginated searches that tie on all user supplied sort fields should not return different page boundaries only because one replica assigned a different internal seq_id during catchup.
Actual behavior
The same committed upsert can be applied as an update on replicas that have the existing doc_id -> seq_id mapping, but as a create on a follower whose snapshot state is missing that id. The follower allocates a fresh local seq_id. After catchup, all user visible fields match on all nodes, but direct node searches using only the tied sort field return a different order on the rebuilt follower.
Representative output from Reproducer A:
Target document fields after catchup:
node 1: {"id":"doc-010","metric":1,"stable_rank":10,"title":"screen 010 catchup"}
node 2: {"id":"doc-010","metric":1,"stable_rank":10,"title":"screen 010 catchup"}
node 3: {"id":"doc-010","metric":1,"stable_rank":10,"title":"screen 010 catchup"}
Tied sort metric:desc, first 12 ids:
node 1: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
node 2: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
node 3: doc-010,doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070
Deterministic sort metric:desc,stable_rank:desc, first 12 ids:
node 1: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
node 3: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
Environment
- Typesense version:
30.2 Docker image by default. The scripts support overriding VERSION=<tag>.
- Operating system: Reproduced with local Docker Desktop on Windows using Git Bash. The scripts are bash and should also run on Linux with Docker.
- Client library & version: none. Reproducers use
curl only.
- Other local dependencies: Docker,
curl, jq, od, paste. The scripts build a local Debian based rocksdb-tools helper image if missing.
Schema / Configuration
The reproducers create a synthetic collection:
{
"name": "screens",
"fields": [
{ "name": "title", "type": "string" },
{ "name": "metric", "type": "int32", "sort": true },
{ "name": "stable_rank", "type": "int32", "sort": true }
]
}
Every document has metric=1, so sort_by=metric:desc produces a large tied group. stable_rank is unique and is used to show that a deterministic secondary sort realigns all nodes.
The HA cluster is local only:
10.245.0.11:8107:8108,10.245.0.12:8107:8108,10.245.0.13:8107:8108
The API key is the placeholder value xyz.
Additional Context
Relevant source behavior:
src/raft_server.cpp: writes serialize request->to_json() into the Raft task. The log entry carries the request body, not a leader assigned seq_id.
src/raft_server.cpp: on_snapshot_load() reloads the local RocksDB state from a snapshot and then initializes in memory state.
src/collection.cpp: Collection::to_doc() looks up the external document id in the local store. If found, it reuses the existing seq_id. If missing, it calls get_next_seq_id() and treats the upsert as a new document.
include/topster.h: KV::is_greater() compares sort scores and then key, where key is the internal seq_id.
Possible fixes to consider:
- Include the leader assigned
seq_id or an equivalent deterministic internal document identity in the replicated write operation, so all replicas apply the same id for the same document.
- Tighten snapshot load plus catchup ordering so a follower cannot apply a document upsert against a state where the corresponding id mapping from the installed snapshot is missing.
- Avoid exposing replica local
seq_id as the final client visible tiebreaker for sorted pagination, or document that users must provide deterministic secondary sort fields whenever sort values can tie.
Replica local seq_id can diverge during snapshot catchup and affect tied sort pagination
Bug Description
When all user supplied sort fields tie, Typesense falls back to the internal
seq_idas the final tiebreaker. Thatseq_idis assigned locally by each replica when it applies a write. The serialized Raft log entry carries the document payload, but not a leader assignedseq_id.This means a follower can end up with a different
seq_idfor the same external document id if it applies an upsert against a local store state where that id is missing, while the other replicas apply the same upsert as an update and reuse the existingseq_id.The first script below demonstrates the origin path deterministically: node 3 loads a snapshot that is missing one document, then catches up an upsert for that same document from Raft. Nodes 1 and 2 reuse the old
seq_id; node 3 allocates a fresh one during catchup. The second script demonstrates the visible consequence once the drift exists: same user fields on all nodes, but different tied sort order on node 3.Both reproducers are fully local and synthetic. They use Docker,
curl,jq, and a temporary localrocksdb-toolscontainer. They do not connect to any remote cluster.Reproduction Steps
Reproducer A: create seq_id drift during snapshot catchup
Save as
reproduce_snapshot_catchup_seq_id_drift.sh, then run:Expected ending:
Reproducer A full script
Reproducer B: visible tied sort consequence once drift exists
Save as
reproduce_seq_id_tied_sort_drift.sh, then run:Expected ending:
Reproducer B full script
Expected vs Actual
Expected behavior
A committed upsert should not produce replica divergent hidden tiebreaker state for the same document id. If two replicas have the same user visible fields for the same documents, paginated searches that tie on all user supplied sort fields should not return different page boundaries only because one replica assigned a different internal
seq_idduring catchup.Actual behavior
The same committed upsert can be applied as an update on replicas that have the existing
doc_id -> seq_idmapping, but as a create on a follower whose snapshot state is missing that id. The follower allocates a fresh localseq_id. After catchup, all user visible fields match on all nodes, but direct node searches using only the tied sort field return a different order on the rebuilt follower.Representative output from Reproducer A:
Environment
30.2Docker image by default. The scripts support overridingVERSION=<tag>.curlonly.curl,jq,od,paste. The scripts build a local Debian basedrocksdb-toolshelper image if missing.Schema / Configuration
The reproducers create a synthetic collection:
{ "name": "screens", "fields": [ { "name": "title", "type": "string" }, { "name": "metric", "type": "int32", "sort": true }, { "name": "stable_rank", "type": "int32", "sort": true } ] }Every document has
metric=1, sosort_by=metric:descproduces a large tied group.stable_rankis unique and is used to show that a deterministic secondary sort realigns all nodes.The HA cluster is local only:
The API key is the placeholder value
xyz.Additional Context
Relevant source behavior:
src/raft_server.cpp: writes serializerequest->to_json()into the Raft task. The log entry carries the request body, not a leader assignedseq_id.src/raft_server.cpp:on_snapshot_load()reloads the local RocksDB state from a snapshot and then initializes in memory state.src/collection.cpp:Collection::to_doc()looks up the external document id in the local store. If found, it reuses the existingseq_id. If missing, it callsget_next_seq_id()and treats the upsert as a new document.include/topster.h:KV::is_greater()compares sort scores and thenkey, wherekeyis the internalseq_id.Possible fixes to consider:
seq_idor an equivalent deterministic internal document identity in the replicated write operation, so all replicas apply the same id for the same document.seq_idas the final client visible tiebreaker for sorted pagination, or document that users must provide deterministic secondary sort fields whenever sort values can tie.