Skip to content

Replica local seq_id can diverge during snapshot catchup and affect tied sort pagination #2948

@alangmartini

Description

@alangmartini

Replica local seq_id can diverge during snapshot catchup and affect tied sort pagination

Bug Description

When all user supplied sort fields tie, Typesense falls back to the internal seq_id as the final tiebreaker. That seq_id is assigned locally by each replica when it applies a write. The serialized Raft log entry carries the document payload, but not a leader assigned seq_id.

This means a follower can end up with a different seq_id for the same external document id if it applies an upsert against a local store state where that id is missing, while the other replicas apply the same upsert as an update and reuse the existing seq_id.

The first script below demonstrates the origin path deterministically: node 3 loads a snapshot that is missing one document, then catches up an upsert for that same document from Raft. Nodes 1 and 2 reuse the old seq_id; node 3 allocates a fresh one during catchup. The second script demonstrates the visible consequence once the drift exists: same user fields on all nodes, but different tied sort order on node 3.

Both reproducers are fully local and synthetic. They use Docker, curl, jq, and a temporary local rocksdb-tools container. They do not connect to any remote cluster.

Reproduction Steps

Reproducer A: create seq_id drift during snapshot catchup

Save as reproduce_snapshot_catchup_seq_id_drift.sh, then run:

bash reproduce_snapshot_catchup_seq_id_drift.sh

Expected ending:

BUG REPRODUCED: seq_id drift was created during snapshot load plus Raft catchup.
Same user fields, same committed upsert, different hidden seq_id tiebreaker.
Reproducer A full script
#!/bin/bash
# Issue: seq_id drift is created during follower snapshot catchup
# Typesense Version: 30.2
# Description:
#   Reproduces the origin path for seq-id-repro locally. Node 3 is restarted from a
#   local Raft snapshot that is missing one document, while an upsert for that
#   same document is committed on nodes 1 and 2 while node 3 is offline. When
#   node 3 starts, it loads the snapshot and catches up the committed upsert from
#   Raft. Nodes 1 and 2 apply that upsert as an update and reuse the old seq_id.
#   Node 3 applies the same log entry against a store where the id is missing,
#   takes the create branch, allocates a fresh seq_id, and then returns a
#   different order for tied sorts.

set -euo pipefail

# ============================================================================
# CONFIGURATION
# ============================================================================

TYPESENSE_API_KEY=xyz
VERSION=${VERSION:-30.2}
COLLECTION=screens
CONTAINER_NAME=typesense-issue-seq-id-repro-seq-id-drift-during-snapshot-catchup
NET=typesense-seqid-origin-net
SUBNET=10.245.0.0/24
NODES_CONF="10.245.0.11:8107:8108,10.245.0.12:8107:8108,10.245.0.13:8107:8108"
DOC_COUNT=80
TARGET_ID=doc-010
TARGET_TITLE="screen 010 catchup"
EXPECTED_TARGET_SEQ_ID=9
COLLECTION_ID=0
ROCKSDB_TOOLS_IMAGE=typesense-local-rocksdb-tools:bookworm

IP=("_" 10.245.0.11 10.245.0.12 10.245.0.13)
HOSTPORT=(0 8418 8428 8438)
NAME=("_" ts-seqid-origin-n1 ts-seqid-origin-n2 ts-seqid-origin-n3)

WORKDIR=$(pwd)
DOCKER_NOCONV=""
case "$(uname -s)" in
  MINGW*|MSYS*|CYGWIN*)
    DOCKER_NOCONV="env MSYS_NO_PATHCONV=1"
    WORKDIR=$(pwd -W)
    ;;
esac

DATA_ROOT="${WORKDIR}/typesense-data-${CONTAINER_NAME}"
DOCS_FILE="${WORKDIR}/docs-origin.jsonl"
RESULT_DIR="${WORKDIR}/results-origin"

# ============================================================================
# CLEANUP FUNCTION
# ============================================================================

cleanup() {
  echo ""
  echo "=== Cleanup ==="
  for i in 1 2 3; do
    docker rm -f "${NAME[$i]}" >/dev/null 2>&1 || true
  done
  docker network rm "${NET}" >/dev/null 2>&1 || true
  rm -rf "${DATA_ROOT}" "${DOCS_FILE}" "${RESULT_DIR}"
  echo "Cleanup complete"
}
trap cleanup EXIT

require_cmd() {
  command -v "$1" >/dev/null 2>&1 || {
    echo "ERROR: required command not found: $1"
    exit 127
  }
}

# ============================================================================
# HELPERS
# ============================================================================

host_for_node() {
  local node=$1
  printf 'http://localhost:%s' "${HOSTPORT[$node]}"
}

api() {
  local node=$1 method=$2 path=$3
  shift 3
  curl -sS -X "${method}" "$(host_for_node "${node}")${path}" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "$@"
}

wait_for_health() {
  local nodes_csv=$1 max_wait=${2:-90}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ok=0 total=0
    IFS=',' read -ra nodes <<< "${nodes_csv}"
    for node in "${nodes[@]}"; do
      total=$((total + 1))
      code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
        "$(host_for_node "${node}")/health" 2>/dev/null || true)
      [ "${code}" = "200" ] && ok=$((ok + 1))
    done
    [ "${ok}" -eq "${total}" ] && return 0
    sleep 1
    waited=$((waited + 1))
  done

  echo "ERROR: nodes ${nodes_csv} did not become healthy"
  for i in 1 2 3; do
    echo "--- ${NAME[$i]} logs ---"
    docker logs --tail 40 "${NAME[$i]}" 2>&1 || true
  done
  exit 2
}

wait_for_collection() {
  local collection=$1 max_wait=${2:-60}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ready=0
    for node in 1 2 3; do
      code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
        "$(host_for_node "${node}")/collections/${collection}" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null || true)
      [ "${code}" = "200" ] && ready=$((ready + 1))
    done
    [ "${ready}" -eq 3 ] && return 0
    sleep 1
    waited=$((waited + 1))
  done
  echo "ERROR: collection ${collection} was not visible on all nodes"
  exit 2
}

collection_count() {
  local node=$1
  api "${node}" GET "/collections/${COLLECTION}" | jq -r '.num_documents // -1'
}

wait_for_count() {
  local nodes_csv=$1 expected=$2 max_wait=${3:-90}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ok=0 total=0
    IFS=',' read -ra nodes <<< "${nodes_csv}"
    for node in "${nodes[@]}"; do
      total=$((total + 1))
      count=$(collection_count "${node}" 2>/dev/null || echo -1)
      [ "${count}" = "${expected}" ] && ok=$((ok + 1))
    done
    [ "${ok}" -eq "${total}" ] && return 0
    sleep 1
    waited=$((waited + 1))
  done
  echo "ERROR: expected count ${expected} on nodes ${nodes_csv}"
  for node in 1 2 3; do
    echo "  node ${node}: $(collection_count "${node}" 2>/dev/null || echo unavailable)"
  done
  exit 2
}

doc_title() {
  local node=$1
  api "${node}" GET "/collections/${COLLECTION}/documents/${TARGET_ID}" \
    | jq -r '.title // ""'
}

wait_for_doc_title() {
  local nodes_csv=$1 expected=$2 max_wait=${3:-90}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ok=0 total=0
    IFS=',' read -ra nodes <<< "${nodes_csv}"
    for node in "${nodes[@]}"; do
      total=$((total + 1))
      title=$(doc_title "${node}" 2>/dev/null || true)
      [ "${title}" = "${expected}" ] && ok=$((ok + 1))
    done
    [ "${ok}" -eq "${total}" ] && return 0
    sleep 1
    waited=$((waited + 1))
  done
  echo "ERROR: expected ${TARGET_ID} title ${expected} on nodes ${nodes_csv}"
  for node in 1 2 3; do
    echo "  node ${node}: $(doc_title "${node}" 2>/dev/null || echo unavailable)"
  done
  exit 2
}

start_node() {
  local node=$1
  local dir="${DATA_ROOT}/data-n${node}"
  mkdir -p "${dir}"
  printf '%s' "${NODES_CONF}" > "${dir}/nodes"
  docker rm -f "${NAME[$node]}" >/dev/null 2>&1 || true
  ${DOCKER_NOCONV} docker run -d \
    --name "${NAME[$node]}" \
    --network "${NET}" \
    --ip "${IP[$node]}" \
    -p "${HOSTPORT[$node]}:8108" \
    -v "${dir}:/data" \
    "typesense/typesense:${VERSION}" \
    --data-dir /data \
    --api-key="${TYPESENSE_API_KEY}" \
    --api-port 8108 \
    --peering-port 8107 \
    --peering-address "${IP[$node]}" \
    --nodes /data/nodes \
    --reset-peers-on-error=false >/dev/null
}

ensure_rocksdb_tools_image() {
  if docker image inspect "${ROCKSDB_TOOLS_IMAGE}" >/dev/null 2>&1; then
    return 0
  fi

  echo "=== Building local RocksDB tools image ==="
  docker build -t "${ROCKSDB_TOOLS_IMAGE}" - <<'DOCKERFILE' >/dev/null
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y rocksdb-tools && rm -rf /var/lib/apt/lists/*
DOCKERFILE
}

hex_ascii() {
  printf '%s' "$1" | od -An -tx1 | tr -d ' \n'
}

ldb_node3_db() {
  local db_path=$1
  shift
  ${DOCKER_NOCONV} docker run --rm \
    -v "${DATA_ROOT}/data-n3:/data" \
    "${ROCKSDB_TOOLS_IMAGE}" \
    ldb --db="${db_path}" --try_load_options=false --ignore_unknown_options "$@"
}

latest_node3_snapshot_db() {
  ${DOCKER_NOCONV} docker run --rm \
    -v "${DATA_ROOT}/data-n3:/data" \
    "${ROCKSDB_TOOLS_IMAGE}" \
    sh -lc "find /data/state/snapshot -type d -name db_snapshot 2>/dev/null | sort | tail -1"
}

delete_target_from_snapshot() {
  local db_path=$1
  local doc_key seq_prefix_hex seq_hex doc_hex seq_key_hex before_doc target_seq_id
  doc_key="${COLLECTION_ID}_\$DI_${TARGET_ID}"
  doc_hex=$(hex_ascii "${doc_key}")
  before_doc=$(ldb_node3_db "${db_path}" get "${doc_key}" 2>/dev/null || true)
  target_seq_id=$(printf '%s' "${before_doc}" | tr -d '[:space:]')

  if ! printf '%s' "${target_seq_id}" | grep -Eq '^[0-9]+$'; then
    echo "ERROR: expected snapshot doc_id key for ${TARGET_ID} to contain numeric seq_id"
    echo "Computed doc key: ${doc_key}"
    echo "Computed doc key hex: 0x${doc_hex}"
    echo "Actual ldb output: ${before_doc}"
    ldb_node3_db "${db_path}" scan --no_value --max_keys=80 || true
    exit 2
  fi

  if [ "${target_seq_id}" != "${EXPECTED_TARGET_SEQ_ID}" ]; then
    echo "WARNING: ${TARGET_ID} mapped to seq_id ${target_seq_id}, expected ${EXPECTED_TARGET_SEQ_ID}; using actual mapping."
  fi

  seq_prefix_hex=$(hex_ascii "${COLLECTION_ID}_\$SI_")
  seq_hex=$(printf '%08x' "${target_seq_id}")
  seq_key_hex="${seq_prefix_hex}${seq_hex}"

  ldb_node3_db "${db_path}" delete "${doc_key}" >/dev/null
  ldb_node3_db "${db_path}" --key_hex delete "0x${seq_key_hex}" >/dev/null
  ldb_node3_db "${db_path}" compact >/dev/null || true
  ${DOCKER_NOCONV} docker run --rm -v "${DATA_ROOT}/data-n3:/data" "${ROCKSDB_TOOLS_IMAGE}" \
    sh -lc "rm -f '${db_path}/LOCK'"

  if ldb_node3_db "${db_path}" get "${doc_key}" >/dev/null 2>&1; then
    echo "ERROR: snapshot still has ${doc_key} after deletion"
    exit 2
  fi

  echo "Deleted ${TARGET_ID} from node 3 snapshot before catchup:"
  echo "  snapshot db ${db_path}"
  echo "  doc_id key ${doc_key}"
  echo "  seq_id ${target_seq_id}"
}

fetch_ids() {
  local node=$1 sort_by=$2 per_page=${3:-12} page=${4:-1}
  curl -sS -G "$(host_for_node "${node}")/collections/${COLLECTION}/documents/search" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    --data-urlencode "q=*" \
    --data-urlencode "query_by=title" \
    --data-urlencode "sort_by=${sort_by}" \
    --data-urlencode "per_page=${per_page}" \
    --data-urlencode "page=${page}" \
    --data-urlencode "include_fields=id,metric,stable_rank,title" \
    | jq -r '.hits[].document.id' | tr -d '\r' | paste -sd ','
}

fetch_doc() {
  local node=$1
  api "${node}" GET "/collections/${COLLECTION}/documents/${TARGET_ID}" \
    | jq -c '{id, metric, stable_rank, title}'
}

assert_equal() {
  local left=$1 right=$2 label=$3
  if [ "${left}" != "${right}" ]; then
    echo "ERROR: ${label}"
    echo "LEFT : ${left}"
    echo "RIGHT: ${right}"
    exit 1
  fi
}

# ============================================================================
# SETUP TYPESENSE
# ============================================================================

require_cmd docker
require_cmd curl
require_cmd jq
require_cmd od
require_cmd paste

cleanup >/dev/null 2>&1 || true
mkdir -p "${DATA_ROOT}" "${RESULT_DIR}"

echo "=== Pulling Typesense ${VERSION} ==="
docker pull "typesense/typesense:${VERSION}" >/dev/null
ensure_rocksdb_tools_image

echo "=== Starting 3 node local HA cluster ==="
docker network create --subnet="${SUBNET}" "${NET}" >/dev/null
for node in 1 2 3; do
  start_node "${node}"
done
wait_for_health "1,2,3"

# ============================================================================
# CREATE COLLECTIONS
# ============================================================================

echo "=== Creating collection ==="
api 1 POST "/collections" \
  -H "Content-Type: application/json" \
  -d '{
    "name":"screens",
    "fields":[
      {"name":"title","type":"string"},
      {"name":"metric","type":"int32","sort":true},
      {"name":"stable_rank","type":"int32","sort":true}
    ]
  }' >/dev/null
wait_for_collection "${COLLECTION}"

# ============================================================================
# IMPORT DOCUMENTS
# ============================================================================

echo "=== Importing tied documents ==="
rm -f "${DOCS_FILE}"
for n in $(seq 1 "${DOC_COUNT}"); do
  raw=$(printf '%03d' "${n}")
  printf '{"id":"doc-%s","title":"screen %s","metric":1,"stable_rank":%d}\n' "${raw}" "${raw}" "${n}" >> "${DOCS_FILE}"
done

curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=create" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -H "Content-Type: text/plain" \
  --data-binary @"${DOCS_FILE}" >/dev/null
wait_for_count "1,2,3" "${DOC_COUNT}"

baseline_1=$(fetch_ids 1 "metric:desc")
baseline_2=$(fetch_ids 2 "metric:desc")
baseline_3=$(fetch_ids 3 "metric:desc")
assert_equal "${baseline_1}" "${baseline_2}" "baseline node 1 and 2 order differ"
assert_equal "${baseline_1}" "${baseline_3}" "baseline node 1 and 3 order differ"
echo "Baseline tied sort order identical on all nodes."

echo "=== Creating clean node 3 snapshot ==="
api 3 POST "/operations/db/compact" >/dev/null
sleep 2
api 3 POST "/operations/snapshot" >/dev/null
SNAPSHOT_DB=$(latest_node3_snapshot_db)
if [ -z "${SNAPSHOT_DB}" ]; then
  echo "ERROR: node 3 snapshot db was not created"
  exit 2
fi
echo "Node 3 snapshot DB: ${SNAPSHOT_DB}"

# ============================================================================
# REPRODUCE THE ORIGIN PATH
# ============================================================================

echo "=== Stop node 3 and make its snapshot miss ${TARGET_ID} ==="
docker stop "${NAME[3]}" >/dev/null
docker rm "${NAME[3]}" >/dev/null
delete_target_from_snapshot "${SNAPSHOT_DB}"

echo "=== Commit upsert while node 3 is offline ==="
UPSERT_RESULT=$(curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=upsert" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -H "Content-Type: text/plain" \
  --data-binary @- <<EOF
{"id":"${TARGET_ID}","title":"${TARGET_TITLE}","metric":1,"stable_rank":10}
EOF
)
if ! printf '%s' "${UPSERT_RESULT}" | jq -e 'select(.success == true)' >/dev/null; then
  echo "ERROR: offline upsert failed"
  printf '%s\n' "${UPSERT_RESULT}"
  exit 2
fi
wait_for_doc_title "1,2" "${TARGET_TITLE}"

echo "=== Restart node 3 so snapshot load plus Raft catchup applies the upsert ==="
start_node 3
wait_for_health "1,2,3"
wait_for_count "1,2,3" "${DOC_COUNT}"
wait_for_doc_title "1,2,3" "${TARGET_TITLE}"

if ! docker logs "${NAME[3]}" 2>&1 | grep -q "on_snapshot_load"; then
  echo "ERROR: node 3 did not log on_snapshot_load"
  docker logs --tail 80 "${NAME[3]}" 2>&1 || true
  exit 2
fi

# ============================================================================
# RESULTS OUTPUT
# ============================================================================

doc_1=$(fetch_doc 1)
doc_2=$(fetch_doc 2)
doc_3=$(fetch_doc 3)
assert_equal "${doc_1}" "${doc_2}" "node 1 and 2 target fields differ after catchup"
assert_equal "${doc_1}" "${doc_3}" "node 1 and 3 target fields differ after catchup"

tied_1=$(fetch_ids 1 "metric:desc")
tied_2=$(fetch_ids 2 "metric:desc")
tied_3=$(fetch_ids 3 "metric:desc")
stable_1=$(fetch_ids 1 "metric:desc,stable_rank:desc")
stable_2=$(fetch_ids 2 "metric:desc,stable_rank:desc")
stable_3=$(fetch_ids 3 "metric:desc,stable_rank:desc")

assert_equal "${tied_1}" "${tied_2}" "node 1 and 2 tied sort order differ"
assert_equal "${stable_1}" "${stable_2}" "node 1 and 2 deterministic sort order differ"
assert_equal "${stable_1}" "${stable_3}" "deterministic secondary sort did not realign nodes"

printf '%s\n' "${tied_1}" > "${RESULT_DIR}/node1-tied-order.txt"
printf '%s\n' "${tied_2}" > "${RESULT_DIR}/node2-tied-order.txt"
printf '%s\n' "${tied_3}" > "${RESULT_DIR}/node3-tied-order.txt"
printf '%s\n' "${stable_1}" > "${RESULT_DIR}/node1-stable-order.txt"
printf '%s\n' "${stable_3}" > "${RESULT_DIR}/node3-stable-order.txt"

echo ""
echo "=== Result ==="
echo "Node 3 loaded a snapshot, then replayed an upsert committed while it was offline."
echo "Target document fields after catchup:"
echo "  node 1: ${doc_1}"
echo "  node 2: ${doc_2}"
echo "  node 3: ${doc_3}"
echo ""
echo "Tied sort metric:desc, first 12 ids:"
echo "  node 1: ${tied_1}"
echo "  node 2: ${tied_2}"
echo "  node 3: ${tied_3}"
echo ""
echo "Deterministic sort metric:desc,stable_rank:desc, first 12 ids:"
echo "  node 1: ${stable_1}"
echo "  node 3: ${stable_3}"
echo ""

if [ "${tied_1}" = "${tied_3}" ]; then
  echo "NOT REPRODUCED: tied sort order stayed identical."
  exit 1
fi

if ! printf '%s' "${tied_3}" | grep -q "^${TARGET_ID},"; then
  echo "NOT REPRODUCED: node 3 order changed, but ${TARGET_ID} did not move to the tied group front."
  exit 1
fi

echo "BUG REPRODUCED: seq_id drift was created during snapshot load plus Raft catchup."
echo "Same user fields, same committed upsert, different hidden seq_id tiebreaker."

Reproducer B: visible tied sort consequence once drift exists

Save as reproduce_seq_id_tied_sort_drift.sh, then run:

bash reproduce_seq_id_tied_sort_drift.sh

Expected ending:

BUG REPRODUCED: node 3 returns a different tied sort order even though user fields match.
This demonstrates seq_id as the hidden tied-sort tiebreaker and why a deterministic secondary sort fixes pagination boundaries.
Reproducer B full script
#!/bin/bash
# Issue: Replica local seq_id drift changes tied sort order
# Typesense Version: 30.2
# Description:
#   Reproduces the customer visible symptom from seq-id-repro locally. Typesense uses
#   the internal seq_id as the final tiebreaker after user supplied sort fields.
#   The seq_id is derived on each replica at apply time and is not carried in
#   the Raft log entry. This script creates a 3 node local HA cluster, creates a
#   node 3 Raft snapshot, removes one document from that snapshot's RocksDB
#   state, restarts node 3 from the mutated snapshot, then sends one replicated
#   upsert for that same id. Nodes 1 and 2 reuse the original seq_id; node 3
#   takes the new document branch and allocates a fresh seq_id. All user fields
#   match after the upsert, but direct node searches sorted only by the tied
#   metric produce a different order on node 3.

set -euo pipefail

# ============================================================================
# CONFIGURATION
# ============================================================================

TYPESENSE_API_KEY=xyz
VERSION=${VERSION:-30.2}
COLLECTION=screens
CONTAINER_NAME=typesense-issue-seq-id-repro-seq-id-tiebreaker-drift
NET=typesense-seqid-net
SUBNET=10.244.0.0/24
NODES_CONF="10.244.0.11:8107:8108,10.244.0.12:8107:8108,10.244.0.13:8107:8108"
DOC_COUNT=80
TARGET_ID=doc-010
EXPECTED_TARGET_SEQ_ID=9
COLLECTION_ID=0
ROCKSDB_TOOLS_IMAGE=typesense-local-rocksdb-tools:bookworm

IP=("_" 10.244.0.11 10.244.0.12 10.244.0.13)
HOSTPORT=(0 8318 8328 8338)
NAME=("_" ts-seqid-n1 ts-seqid-n2 ts-seqid-n3)

WORKDIR=$(pwd)
DOCKER_NOCONV=""
case "$(uname -s)" in
  MINGW*|MSYS*|CYGWIN*)
    DOCKER_NOCONV="env MSYS_NO_PATHCONV=1"
    WORKDIR=$(pwd -W)
    ;;
esac

DATA_ROOT="${WORKDIR}/typesense-data-${CONTAINER_NAME}"
DOCS_FILE="${WORKDIR}/docs.jsonl"
RESULT_DIR="${WORKDIR}/results"

# ============================================================================
# CLEANUP FUNCTION
# ============================================================================

cleanup() {
  echo ""
  echo "=== Cleanup ==="
  for i in 1 2 3; do
    docker rm -f "${NAME[$i]}" >/dev/null 2>&1 || true
  done
  docker network rm "${NET}" >/dev/null 2>&1 || true
  rm -rf "${DATA_ROOT}" "${DOCS_FILE}" "${RESULT_DIR}"
  echo "Cleanup complete"
}
trap cleanup EXIT

require_cmd() {
  command -v "$1" >/dev/null 2>&1 || {
    echo "ERROR: required command not found: $1"
    exit 127
  }
}

# ============================================================================
# HELPERS
# ============================================================================

host_for_node() {
  local node=$1
  printf 'http://localhost:%s' "${HOSTPORT[$node]}"
}

api() {
  local node=$1 method=$2 path=$3
  shift 3
  curl -sS -X "${method}" "$(host_for_node "${node}")${path}" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" "$@"
}

wait_for_health() {
  local nodes_csv=$1 max_wait=${2:-90}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ok=0 total=0
    IFS=',' read -ra nodes <<< "${nodes_csv}"
    for node in "${nodes[@]}"; do
      total=$((total + 1))
      code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
        "$(host_for_node "${node}")/health" 2>/dev/null || true)
      [ "${code}" = "200" ] && ok=$((ok + 1))
    done
    [ "${ok}" -eq "${total}" ] && return 0
    sleep 1
    waited=$((waited + 1))
  done

  echo "ERROR: nodes ${nodes_csv} did not become healthy"
  for i in 1 2 3; do
    echo "--- ${NAME[$i]} logs ---"
    docker logs --tail 30 "${NAME[$i]}" 2>&1 || true
  done
  exit 2
}

wait_for_collection() {
  local collection=$1 max_wait=${2:-60}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ready=0
    for node in 1 2 3; do
      code=$(curl -s -o /dev/null -w "%{http_code}" --max-time 2 \
        "$(host_for_node "${node}")/collections/${collection}" \
        -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" 2>/dev/null || true)
      [ "${code}" = "200" ] && ready=$((ready + 1))
    done
    [ "${ready}" -eq 3 ] && return 0
    sleep 1
    waited=$((waited + 1))
  done
  echo "ERROR: collection ${collection} was not visible on all nodes"
  exit 2
}

collection_count() {
  local node=$1
  api "${node}" GET "/collections/${COLLECTION}" | jq -r '.num_documents // -1'
}

wait_for_count() {
  local nodes_csv=$1 expected=$2 max_wait=${3:-90}
  local waited=0
  while [ "${waited}" -lt "${max_wait}" ]; do
    local ok=0 total=0
    IFS=',' read -ra nodes <<< "${nodes_csv}"
    for node in "${nodes[@]}"; do
      total=$((total + 1))
      count=$(collection_count "${node}" 2>/dev/null || echo -1)
      [ "${count}" = "${expected}" ] && ok=$((ok + 1))
    done
    [ "${ok}" -eq "${total}" ] && return 0
    sleep 1
    waited=$((waited + 1))
  done
  echo "ERROR: expected count ${expected} on nodes ${nodes_csv}"
  for node in 1 2 3; do
    echo "  node ${node}: $(collection_count "${node}" 2>/dev/null || echo unavailable)"
  done
  exit 2
}

start_node() {
  local node=$1
  local dir="${DATA_ROOT}/data-n${node}"
  mkdir -p "${dir}"
  printf '%s' "${NODES_CONF}" > "${dir}/nodes"
  docker rm -f "${NAME[$node]}" >/dev/null 2>&1 || true
  ${DOCKER_NOCONV} docker run -d \
    --name "${NAME[$node]}" \
    --network "${NET}" \
    --ip "${IP[$node]}" \
    -p "${HOSTPORT[$node]}:8108" \
    -v "${dir}:/data" \
    "typesense/typesense:${VERSION}" \
    --data-dir /data \
    --api-key="${TYPESENSE_API_KEY}" \
    --api-port 8108 \
    --peering-port 8107 \
    --peering-address "${IP[$node]}" \
    --nodes /data/nodes \
    --reset-peers-on-error=false >/dev/null
}

ensure_rocksdb_tools_image() {
  if docker image inspect "${ROCKSDB_TOOLS_IMAGE}" >/dev/null 2>&1; then
    return 0
  fi

  echo "=== Building local RocksDB tools image ==="
  docker build -t "${ROCKSDB_TOOLS_IMAGE}" - <<'DOCKERFILE' >/dev/null
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y rocksdb-tools && rm -rf /var/lib/apt/lists/*
DOCKERFILE
}

hex_ascii() {
  printf '%s' "$1" | od -An -tx1 | tr -d ' \n'
}

ldb_node3_db() {
  local db_path=$1
  shift
  ${DOCKER_NOCONV} docker run --rm \
    -v "${DATA_ROOT}/data-n3:/data" \
    "${ROCKSDB_TOOLS_IMAGE}" \
    ldb --db="${db_path}" --try_load_options=false --ignore_unknown_options "$@"
}

ldb_node3() {
  ldb_node3_db /data/db "$@"
}

latest_node3_snapshot_db() {
  ${DOCKER_NOCONV} docker run --rm \
    -v "${DATA_ROOT}/data-n3:/data" \
    "${ROCKSDB_TOOLS_IMAGE}" \
    sh -lc "find /data/state/snapshot -type d -name db_snapshot 2>/dev/null | sort | tail -1"
}

delete_target_from_node3_store() {
  local db_path=${1:-/data/db}
  local doc_key seq_prefix_hex seq_hex doc_hex seq_key_hex before_doc target_seq_id
  doc_key="${COLLECTION_ID}_\$DI_${TARGET_ID}"
  doc_hex=$(hex_ascii "${doc_key}")

  before_doc=$(ldb_node3_db "${db_path}" get "${doc_key}" 2>/dev/null || true)
  target_seq_id=$(printf '%s' "${before_doc}" | tr -d '[:space:]')

  if ! printf '%s' "${target_seq_id}" | grep -Eq '^[0-9]+$'; then
    echo "ERROR: expected node 3 doc_id key for ${TARGET_ID} to contain numeric seq_id"
    echo "Computed doc key: ${doc_key}"
    echo "Computed doc key hex: 0x${doc_hex}"
    echo "Actual ldb output: ${before_doc}"
    echo "Node 3 data directory:"
    ${DOCKER_NOCONV} docker run --rm -v "${DATA_ROOT}/data-n3:/data" "${ROCKSDB_TOOLS_IMAGE}" \
      sh -lc "find /data -maxdepth 2 -type f | sort | head -40" || true
    echo "Node 3 key sample:"
    ldb_node3_db "${db_path}" scan --no_value --max_keys=60 || true
    exit 2
  fi

  if [ "${target_seq_id}" != "${EXPECTED_TARGET_SEQ_ID}" ]; then
    echo "WARNING: ${TARGET_ID} mapped to seq_id ${target_seq_id}, expected ${EXPECTED_TARGET_SEQ_ID}; using actual mapping."
  fi

  seq_prefix_hex=$(hex_ascii "${COLLECTION_ID}_\$SI_")
  seq_hex=$(printf '%08x' "${target_seq_id}")
  seq_key_hex="${seq_prefix_hex}${seq_hex}"

  ldb_node3_db "${db_path}" delete "${doc_key}" >/dev/null
  ldb_node3_db "${db_path}" --key_hex delete "0x${seq_key_hex}" >/dev/null
  ldb_node3_db "${db_path}" compact >/dev/null || true
  ${DOCKER_NOCONV} docker run --rm -v "${DATA_ROOT}/data-n3:/data" "${ROCKSDB_TOOLS_IMAGE}" \
    sh -lc "rm -f '${db_path}/LOCK'"

  echo "Removed node 3 local keys for ${TARGET_ID}:"
  echo "  db path ${db_path}"
  echo "  doc_id key ${doc_key}"
  echo "  seq_id ${target_seq_id}"
}

fetch_ids() {
  local node=$1 sort_by=$2 per_page=${3:-12} page=${4:-1}
  curl -sS -G "$(host_for_node "${node}")/collections/${COLLECTION}/documents/search" \
    -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
    --data-urlencode "q=*" \
    --data-urlencode "query_by=title" \
    --data-urlencode "sort_by=${sort_by}" \
    --data-urlencode "per_page=${per_page}" \
    --data-urlencode "page=${page}" \
    --data-urlencode "include_fields=id,metric,stable_rank,title" \
    | jq -r '.hits[].document.id' | tr -d '\r' | paste -sd ','
}

fetch_doc() {
  local node=$1
  api "${node}" GET "/collections/${COLLECTION}/documents/${TARGET_ID}" \
    | jq -c '{id, metric, stable_rank, title}'
}

assert_equal() {
  local left=$1 right=$2 label=$3
  if [ "${left}" != "${right}" ]; then
    echo "ERROR: ${label}"
    echo "LEFT : ${left}"
    echo "RIGHT: ${right}"
    exit 1
  fi
}

# ============================================================================
# SETUP TYPESENSE
# ============================================================================

require_cmd docker
require_cmd curl
require_cmd jq
require_cmd od
require_cmd paste

cleanup >/dev/null 2>&1 || true
mkdir -p "${DATA_ROOT}" "${RESULT_DIR}"

echo "=== Pulling Typesense ${VERSION} ==="
docker pull "typesense/typesense:${VERSION}" >/dev/null
ensure_rocksdb_tools_image

echo "=== Starting 3 node local HA cluster ==="
docker network create --subnet="${SUBNET}" "${NET}" >/dev/null
for node in 1 2 3; do
  start_node "${node}"
done
wait_for_health "1,2,3"

# ============================================================================
# CREATE COLLECTIONS
# ============================================================================

echo "=== Creating collection ==="
api 1 POST "/collections" \
  -H "Content-Type: application/json" \
  -d '{
    "name":"screens",
    "fields":[
      {"name":"title","type":"string"},
      {"name":"metric","type":"int32","sort":true},
      {"name":"stable_rank","type":"int32","sort":true}
    ]
  }' >/dev/null
wait_for_collection "${COLLECTION}"

# ============================================================================
# IMPORT DOCUMENTS
# ============================================================================

echo "=== Importing tied documents ==="
rm -f "${DOCS_FILE}"
for n in $(seq 1 "${DOC_COUNT}"); do
  raw=$(printf '%03d' "${n}")
  printf '{"id":"doc-%s","title":"screen %s","metric":1,"stable_rank":%d}\n' "${raw}" "${raw}" "${n}" >> "${DOCS_FILE}"
done

curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=create" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -H "Content-Type: text/plain" \
  --data-binary @"${DOCS_FILE}" >/dev/null
wait_for_count "1,2,3" "${DOC_COUNT}"

baseline_1=$(fetch_ids 1 "metric:desc")
baseline_2=$(fetch_ids 2 "metric:desc")
baseline_3=$(fetch_ids 3 "metric:desc")
assert_equal "${baseline_1}" "${baseline_2}" "baseline node 1 and 2 order differ"
assert_equal "${baseline_1}" "${baseline_3}" "baseline node 1 and 3 order differ"
echo "Baseline tied sort order identical on all nodes."

echo "=== Compacting node 3 RocksDB before local mutation ==="
api 3 POST "/operations/db/compact" >/dev/null
sleep 2

echo "=== Creating node 3 local Raft snapshot ==="
api 3 POST "/operations/snapshot" >/dev/null
SNAPSHOT_DB=$(latest_node3_snapshot_db)
if [ -z "${SNAPSHOT_DB}" ]; then
  echo "ERROR: node 3 snapshot db was not created"
  exit 2
fi
echo "Node 3 snapshot DB: ${SNAPSHOT_DB}"

# ============================================================================
# REPRODUCE THE ISSUE
# ============================================================================

echo "=== Simulating node 3 snapshot miss for ${TARGET_ID} ==="
docker stop "${NAME[3]}" >/dev/null
docker rm "${NAME[3]}" >/dev/null
delete_target_from_node3_store "${SNAPSHOT_DB}"

echo "=== Restarting node 3 with target doc missing locally ==="
start_node 3
wait_for_health "1,2,3"
wait_for_count "1,2" "${DOC_COUNT}"
wait_for_count "3" "$((DOC_COUNT - 1))"

echo "=== Upserting ${TARGET_ID} through local Raft write path ==="
UPSERT_RESULT=$(curl -sS -f "$(host_for_node 1)/collections/${COLLECTION}/documents/import?action=upsert" \
  -H "X-TYPESENSE-API-KEY: ${TYPESENSE_API_KEY}" \
  -H "Content-Type: text/plain" \
  --data-binary @- <<EOF
{"id":"${TARGET_ID}","title":"screen 010 updated","metric":1,"stable_rank":10}
EOF
)
if ! printf '%s' "${UPSERT_RESULT}" | jq -e 'select(.success == true)' >/dev/null; then
  echo "ERROR: upsert failed"
  printf '%s\n' "${UPSERT_RESULT}"
  exit 2
fi
wait_for_count "1,2,3" "${DOC_COUNT}"

doc_1=$(fetch_doc 1)
doc_2=$(fetch_doc 2)
doc_3=$(fetch_doc 3)
assert_equal "${doc_1}" "${doc_2}" "node 1 and 2 target fields differ after upsert"
assert_equal "${doc_1}" "${doc_3}" "node 1 and 3 target fields differ after upsert"

tied_1=$(fetch_ids 1 "metric:desc")
tied_2=$(fetch_ids 2 "metric:desc")
tied_3=$(fetch_ids 3 "metric:desc")
stable_1=$(fetch_ids 1 "metric:desc,stable_rank:desc")
stable_2=$(fetch_ids 2 "metric:desc,stable_rank:desc")
stable_3=$(fetch_ids 3 "metric:desc,stable_rank:desc")

assert_equal "${tied_1}" "${tied_2}" "node 1 and 2 tied sort order differ"
assert_equal "${stable_1}" "${stable_2}" "node 1 and 2 deterministic sort order differ"
assert_equal "${stable_1}" "${stable_3}" "deterministic secondary sort did not realign nodes"

printf '%s\n' "${tied_1}" > "${RESULT_DIR}/node1-tied-order.txt"
printf '%s\n' "${tied_2}" > "${RESULT_DIR}/node2-tied-order.txt"
printf '%s\n' "${tied_3}" > "${RESULT_DIR}/node3-tied-order.txt"
printf '%s\n' "${stable_1}" > "${RESULT_DIR}/node1-stable-order.txt"
printf '%s\n' "${stable_3}" > "${RESULT_DIR}/node3-stable-order.txt"

# ============================================================================
# RESULTS OUTPUT
# ============================================================================

echo ""
echo "=== Result ==="
echo "Target document fields after upsert:"
echo "  node 1: ${doc_1}"
echo "  node 2: ${doc_2}"
echo "  node 3: ${doc_3}"
echo ""
echo "Tied sort metric:desc, first 12 ids:"
echo "  node 1: ${tied_1}"
echo "  node 2: ${tied_2}"
echo "  node 3: ${tied_3}"
echo ""
echo "Deterministic sort metric:desc,stable_rank:desc, first 12 ids:"
echo "  node 1: ${stable_1}"
echo "  node 3: ${stable_3}"
echo ""

if [ "${tied_1}" = "${tied_3}" ]; then
  echo "NOT REPRODUCED: tied sort order stayed identical."
  exit 1
fi

if ! printf '%s' "${tied_3}" | grep -q "^${TARGET_ID},"; then
  echo "NOT REPRODUCED: node 3 order changed, but ${TARGET_ID} did not move to the tied group front."
  exit 1
fi

echo "BUG REPRODUCED: node 3 returns a different tied sort order even though user fields match."
echo "This demonstrates seq_id as the hidden tied-sort tiebreaker and why a deterministic secondary sort fixes pagination boundaries."

Expected vs Actual

Expected behavior

A committed upsert should not produce replica divergent hidden tiebreaker state for the same document id. If two replicas have the same user visible fields for the same documents, paginated searches that tie on all user supplied sort fields should not return different page boundaries only because one replica assigned a different internal seq_id during catchup.

Actual behavior

The same committed upsert can be applied as an update on replicas that have the existing doc_id -> seq_id mapping, but as a create on a follower whose snapshot state is missing that id. The follower allocates a fresh local seq_id. After catchup, all user visible fields match on all nodes, but direct node searches using only the tied sort field return a different order on the rebuilt follower.

Representative output from Reproducer A:

Target document fields after catchup:
  node 1: {"id":"doc-010","metric":1,"stable_rank":10,"title":"screen 010 catchup"}
  node 2: {"id":"doc-010","metric":1,"stable_rank":10,"title":"screen 010 catchup"}
  node 3: {"id":"doc-010","metric":1,"stable_rank":10,"title":"screen 010 catchup"}

Tied sort metric:desc, first 12 ids:
  node 1: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
  node 2: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
  node 3: doc-010,doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070

Deterministic sort metric:desc,stable_rank:desc, first 12 ids:
  node 1: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069
  node 3: doc-080,doc-079,doc-078,doc-077,doc-076,doc-075,doc-074,doc-073,doc-072,doc-071,doc-070,doc-069

Environment

  • Typesense version: 30.2 Docker image by default. The scripts support overriding VERSION=<tag>.
  • Operating system: Reproduced with local Docker Desktop on Windows using Git Bash. The scripts are bash and should also run on Linux with Docker.
  • Client library & version: none. Reproducers use curl only.
  • Other local dependencies: Docker, curl, jq, od, paste. The scripts build a local Debian based rocksdb-tools helper image if missing.

Schema / Configuration

The reproducers create a synthetic collection:

{
  "name": "screens",
  "fields": [
    { "name": "title", "type": "string" },
    { "name": "metric", "type": "int32", "sort": true },
    { "name": "stable_rank", "type": "int32", "sort": true }
  ]
}

Every document has metric=1, so sort_by=metric:desc produces a large tied group. stable_rank is unique and is used to show that a deterministic secondary sort realigns all nodes.

The HA cluster is local only:

10.245.0.11:8107:8108,10.245.0.12:8107:8108,10.245.0.13:8107:8108

The API key is the placeholder value xyz.

Additional Context

Relevant source behavior:

  • src/raft_server.cpp: writes serialize request->to_json() into the Raft task. The log entry carries the request body, not a leader assigned seq_id.
  • src/raft_server.cpp: on_snapshot_load() reloads the local RocksDB state from a snapshot and then initializes in memory state.
  • src/collection.cpp: Collection::to_doc() looks up the external document id in the local store. If found, it reuses the existing seq_id. If missing, it calls get_next_seq_id() and treats the upsert as a new document.
  • include/topster.h: KV::is_greater() compares sort scores and then key, where key is the internal seq_id.

Possible fixes to consider:

  1. Include the leader assigned seq_id or an equivalent deterministic internal document identity in the replicated write operation, so all replicas apply the same id for the same document.
  2. Tighten snapshot load plus catchup ordering so a follower cannot apply a document upsert against a state where the corresponding id mapping from the installed snapshot is missing.
  3. Avoid exposing replica local seq_id as the final client visible tiebreaker for sorted pagination, or document that users must provide deterministic secondary sort fields whenever sort values can tie.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions