Anbeeld's BeeLlama.cpp

BeeLlama.cpp (or just Bee) is a performance-focused llama.cpp fork for squeezing more speed and context out of local GGUF inference. It keeps the familiar llama.cpp tools and server flow, then adds DFlash speculative decoding, adaptive draft control, TurboQuant/TCQ KV-cache compression, and reasoning-loop protection, with full multimodal support.

Not quite a pegasus, but close enough.

Plug-and-Play Setups

Fork Features

DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline, the fringe alternative maps acceptance-rate bands to draft depth.
Full multimodal support: When --mmproj is active, the server keeps DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with TCQ types offering good precision for their size. Set independently with --cache-type-k and --cache-type-v.
Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close, with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model.

For the full feature and public-repo comparison, read docs/beellama-features.md. For the complete argument reference, read docs/beellama-args.md.

TurboQuant (WHT-based scalar quantization) originates from TheTom/llama-cpp-turboquant. TCQ (Trellis-Coded Quantization) and basic DFlash implementation originate from spiritbuun/buun-llama-cpp (paper: Closing the Gap: Trellis-Coded Quantization for KV Cache at 2-3 Bits).

DFlash Speedup

DFlash is strongest on structured, repetitive generation: code, tests, boilerplate, JSON-like formats, and other low-entropy continuations. Open-ended prose is much less predictable, so gains are smaller.

Setup: Windows 11, AMD Ryzen 7 5700X3D, 32 GB DDR4 RAM, RTX 3090 24 GB
Config: same as in quick start docs (Qwen 3.6 27B, Gemma 4 31B), but with reasoning off for non-chat prompts
Baseline and MTP server in comparison: llama.cpp b9275 CUDA 13.1 Windows prebuilt

Benchmark prompts

Task store module

Write one complete Python 3 file using only the standard library.

Return only Python code. Do not use markdown, comments, tests, examples, or explanatory text.

Implement a deterministic Task store module with a compact, repetitive structure that is easy to predict.

Required shape:
- imports: dataclasses, datetime, typing
- dataclass Task with fields id: int, title: str, status: str, created_at: str
- class TaskStore with an internal dict[int, Task]
- methods: add, get, rename, mark_done, reopen, delete, clear, list_all, list_open, list_done, count_open, count_done, titles, to_dicts, __len__, __contains__
- add assigns increasing integer ids starting at 1
- valid statuses are "open" and "done"
- all list methods return tasks sorted by id
- count_open and count_done use explicit loops
- titles returns task titles sorted by task id
- to_dicts returns deterministic dictionaries sorted by id
- to_dicts includes id, title, status, and created_at keys for every task
- raise ValueError for empty title or missing task id
- use straightforward if statements and explicit loops
- keep method bodies short and similar in style
- no argparse, no JSON, no file IO, no unittest, no pytest
- target about 110 to 132 lines of code
- define __all__ = ["Task", "TaskStore"]
- stop immediately after defining __all__

KV report module

Write one complete Python 3 file using only the standard library.

Return only Python code. Do not use markdown, comments, tests, examples, or explanatory text.

Implement a deterministic KV report module with a compact, repetitive structure that is easy to predict.

Required shape:
- imports: dataclasses, typing
- dataclass Row with fields key: str, value: str
- class Report with an internal list[Row]
- methods: add, set, get, delete, clear, keys, values, items, sorted_rows, render_lines, render_text, render_csv, filter_prefix, update_many, to_dict, copy, count_prefix, first_key, __len__, __contains__
- add appends a new row and rejects duplicate keys
- set updates an existing row or appends a new row
- get returns the value for a key
- delete removes a row by key
- keys, values, and items preserve insertion order
- sorted_rows returns rows sorted by key
- render_lines returns strings formatted as "key: value"
- render_text joins render_lines with newline characters
- render_csv returns deterministic "key,value" lines with a header
- filter_prefix returns a new Report containing keys that start with the prefix
- update_many applies set for each key and value in a dictionary sorted by key
- to_dict returns a deterministic dictionary sorted by key
- copy returns a new Report with the same rows in the same order
- count_prefix returns the number of keys that start with the prefix using an explicit loop
- first_key returns the first key and raises ValueError when there are no rows
- raise ValueError for empty keys, duplicate keys, or missing keys
- use straightforward if statements and explicit loops
- keep method bodies short and similar in style
- no enum, no alignment modes, no markdown table, no textwrap, no itertools, no unittest, no pytest
- target about 130 to 155 lines of code
- define __all__ = ["Row", "Report"]
- stop immediately after defining __all__

Doubly-linked list

Write a complete Python 3 module implementing a doubly-linked list with the following methods: append, prepend, insert_at, remove_at, find, reverse, to_list, length, is_empty, iter. Include comprehensive docstrings, type hints, and pytest unit tests for every method. Return only the code, no commentary.

Multi-turn coding

Turn 1

Build an async WebSocket gateway for telemetry devices. Use Python 3.12 style typing.

Return exactly one labeled code block per requested file. Label each code block with a file header like `# ws_gateway/router.py`; outside code blocks, use at most one short sentence per file explaining why it exists. Assume this is a chat response only: do not mention saving files or accessing a filesystem.

Return code blocks for:
- `ws_gateway/models.py`: dataclasses/enums for `Topic`, `OutboundMessage`, `AckState`, `ClientId`, and `MetricSnapshot`
- `ws_gateway/metrics.py`: small standard-library metrics collector with counters/gauges and a Prometheus text export method
- `ws_gateway/errors.py`: error types for invalid topics, duplicate subscriptions, queue overflow policy errors, and closed sessions
- `ws_gateway/config.py`: immutable config dataclass for queue depth, ping interval, ack timeout, and rate limits

Keep prose brief and adjacent to the code. Do not write a separate architecture essay. Do not add extra files, alternate designs, or optional extensions.

Turn 2

Now implement the topic router and subscriber queues. Return code-first output only: labeled code blocks plus brief inline notes if needed.

Return code blocks for:
- `ws_gateway/router.py`
- `tests/test_router.py`

Requirements:
- MQTT-like topics with `+` for one segment and `#` for the remaining suffix
- `subscribe`, `unsubscribe`, `publish`, `drain`, and `close_subscriber`
- stable per-subscriber ordering
- bounded per-subscriber queues with drop-oldest backpressure using a per-subscriber `deque(maxlen=...)`
- injectable clock for tests
- metrics updates for publishes, deliveries, drops, and active subscriptions
- pytest tests for wildcard matching, ordering, unsubscribe cleanup, and queue overflow

Use only the standard library in runtime code. Stay within the names and structure already established unless this prompt explicitly changes them.

Turn 3

Add the connection/session layer over the router. Keep the answer mostly code: labeled code blocks and tests, no standalone design section.

Return code blocks for:
- `ws_gateway/session.py`
- `tests/test_session.py`

Implement:
- `ClientSession` with user id, connection id, subscribed topics, last pong time, pending ack ids, and lifecycle state
- `WebSocketLike` protocol with `send_json`, `close`, and `ping`
- `ConnectionManager` that accepts an authenticated session, starts/stops tasks, maps users to sessions, and unsubscribes on disconnect
- outbound delivery loop from subscriber queue to websocket
- ping/pong keepalive with zombie detection at 3x interval
- incoming token-bucket rate limiter at 10 messages/sec with burst 20
- tests for clean disconnect, zombie close, slow consumer drops, and rate limit rejection

Reuse the exact names and APIs already established unless this prompt explicitly changes them.

Turn 4

Wire the service boundary. Keep prose to one-line notes beside code blocks.

Return code blocks for:
- `ws_gateway/app.py`: aiohttp app factory, websocket handler skeleton, health/readiness handlers, JWT verifier protocol, and graceful shutdown hooks
- `ws_gateway/persistence.py`: minimal ack persistence protocol plus an in-memory implementation for tests
- `tests/test_app_lifecycle.py`: tests for shutdown order, readiness false when persistence is unhealthy, and ack drain on disconnect
- `docs/load_test_plan.md`: concise checklist only, no paragraphs

Keep handlers thin and concrete. Do not introduce extra modules or abstractions beyond the listed files. The shutdown order and dependency boundaries must be concrete in code. Reuse the exact names and APIs already established unless this prompt explicitly changes them.

Turn 5

Patch the code from this conversation. Output only chat-safe patch content: one short comment line per issue, then a unified diff or replacement snippet, then a pytest test. No separate review prose and no filesystem instructions.

Fix the top 5 concrete correctness problems likely in this gateway:
- leaked delivery or keepalive tasks on disconnect
- duplicate subscription state after reconnect
- unfair drop-oldest behavior across hot and cold topics
- ack persistence race during shutdown
- readiness or metrics reporting success while internal tasks are failing

For each fix, include:
1. a one-line bug label as a code comment
2. the smallest Python diff or replacement snippet
3. one focused pytest test that fails before the fix

Stay within the existing file labels and APIs.

Qwen 3.6 27B

Target model: Qwen 3.6 27B Q5_K_S or Qwen 3.6 27B MTP Q5_K_S. DFlash model: Q4_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	~1K tok	37.2 tok/s	37.2 tok/s	1.00x	N/A
Task store module	DFlash	~1K tok	163.9 tok/s	181.9 tok/s	4.40x	67.7% / 89.2%
Task store module	MTP	~1K tok	69.3 tok/s	69.6 tok/s	1.86x	92.0% / 73.3%
KV report module	Baseline	~1K tok	34.6 tok/s	36.5 tok/s	1.00x	N/A
KV report module	DFlash	~1K tok	157.7 tok/s	162.5 tok/s	4.56x	58.8% / 88.9%
KV report module	MTP	~1K tok	67.3 tok/s	68.1 tok/s	1.94x	89.3% / 73.0%
Doubly-linked list	Baseline	~4K tok	36.8 tok/s	36.9 tok/s	1.00x	N/A
Doubly-linked list	DFlash	~4K tok	130.8 tok/s	154.1 tok/s	3.56x	50.4% / 86.8%
Doubly-linked list	MTP	~4K tok	66.3 tok/s	68.0 tok/s	1.80x	87.8% / 72.5%
Prompt processing	Baseline	~20K tok	1229.5 tok/s	1229.5 tok/s	1.00x	N/A
Prompt processing	DFlash	~20K tok	1214.4 tok/s	1221.7 tok/s	0.99x	N/A
Prompt processing	MTP	~20K tok	1162.6 tok/s	1164.7 tok/s	0.95x	N/A
Multi-turn coding	Baseline	~28K tok	33.3 tok/s	33.3 tok/s	1.00x	N/A
Multi-turn coding	DFlash	~30K tok	64.6 tok/s	65.4 tok/s	1.94x	24.9% / 72.9%
Multi-turn coding	MTP	~34K tok	56.5 tok/s	56.5 tok/s	1.70x	71.9% / 68.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

Gemma 4 31B

Target model: Gemma 4 31B Q4_K_S. DFlash model: Q5_K_M.

Prompt	Server	Output	Median	Best	Speedup	Acceptance
Task store module	Baseline	~1K tok	36.1 tok/s	36.1 tok/s	1.00x	N/A
Task store module	DFlash	~1K tok	177.8 tok/s	182.0 tok/s	4.93x	65.7% / 90.0%
KV report module	Baseline	~1K tok	35.9 tok/s	36.0 tok/s	1.00x	N/A
KV report module	DFlash	~1K tok	154.3 tok/s	162.8 tok/s	4.29x	55.7% / 88.6%
Doubly-linked list	Baseline	~1.9K tok	36.0 tok/s	36.0 tok/s	1.00x	N/A
Doubly-linked list	DFlash	~1.9K tok	116.6 tok/s	127.3 tok/s	3.24x	44.5% / 84.9%
Prompt processing	Baseline	~24K tok	1021.3 tok/s	1021.3 tok/s	1.00x	N/A
Prompt processing	DFlash	~24K tok	954.5 tok/s	954.9 tok/s	0.93x	N/A
Multi-turn coding	Baseline	~12K tok	34.8 tok/s	34.8 tok/s	1.00x	N/A
Multi-turn coding	DFlash	~12K tok	60.6 tok/s	64.1 tok/s	1.74x	24.4% / 72.3%

Acceptance: accepted to proposed draft tokens / accepted draft tokens to final generated tokens

KV Cache Quantization

K and V cache types are set independently with --cache-type-k and --cache-type-v. For benchmark details, see KV Cache Quantization Benchmarks for Long Context.

Preset Ladder

K / V	% of bf16 size	99.9% precision	What it is for
bf16 / bf16	100	100.0%	Preserving full quality
q8_0 / q8_0	53.1	94.6%	Compression with minimal losses
q8_0 / q5_1	45.3	94.2%	Great precision/size at high end
q8_0 / q5_0	43.8	93.7%	If q8_0 / q5_1 doesn't fit just a bit
q5_0 / q5_0	34.4	92.7%	Good precision/size, relatively safe
q5_0 / q4_1	32.8	92.6%	Best default if VRAM-constrained
q5_0 / q4_0	31.3	91.4%	If q5_0 / q4_1 doesn't fit just a bit
q4_0 / q4_0	28.1	89.8%	Memory saving with precision loss
q4_0 / turbo3_tcq	24.2	84.9%	Smaller than q4, cleaner than turbo3_tcq
turbo3_tcq / turbo3_tcq	20.3	81.6%	Viable as extreme compression
turbo2_tcq / turbo2_tcq	14.1	54.4%	Last resort: not for code and tool calls

99.9% precision = 100 · exp(−(quantKLD − bf16KLD)) at the 99.9% KL-divergence tail.

Type Reference

Type	Origin	bpv	Diff vs bf16	Notes
q8_0	upstream	8.5	1.88×	High-fidelity K or V
q5_1	upstream	6	2.67×	Conservative, might be better for V than q5_0
q5_0	upstream	5.5	2.91×	Strong K type for VRAM constrained configs
q4_1	upstream	5	3.2×	Smaller than q5_0, but weaker in the tail. Prefer q5_0 for K
q4_0	upstream	4.5	3.56×	Default high compression type, decent at its size
turbo4	fork	4.125	3.88×	Barely smaller than q4_0, slower, worse tail
turbo3_tcq	fork	3.25	4.92×	Viable compact mode, 82% precision at KLD 99.9%. CUDA-only
turbo3	fork	3.125	5.12×	Weaker than turbo3_tcq. Use only when TCQ is unavailable
turbo2_tcq	fork	2.25	7.11×	Last resort, 54% precision at KLD 99.9%. CUDA-only
turbo2	fork	2.125	7.53×	Extreme quality risk. Use only when TCQ is unavailable

Installation

Plug-and-Play Setups

Prebuilt (Windows)

Download the release archive for your CUDA version (12.4 or 13.1) from the releases page. Extract it. The server binary is llama-server.exe. Don't forget to download a separate archive with CUDA libraries and place it in the same folder!

Building from source with -DGGML_NATIVE=ON may result in a tiny bit better performance, so it might still be a good idea to do that if/when you decide to use this fork long-term.

CUDA Build

# Linux (GCC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

# Windows (MSVC + CUDA)
cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON ^
  -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON ^
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release --parallel

# macOS (Metal)
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

GGML_CUDA_FA_ALL_QUANTS=ON is required for TurboQuant and TCQ cache types. Add -DCMAKE_CUDA_ARCHITECTURES=86 for RTX 3090, or -DCMAKE_CUDA_ARCHITECTURES=89 for RTX 4090, if cross-compiling or building in CI without a GPU.

Other Backends

Bee inherits llama.cpp backend support, including Metal, HIP, Vulkan, SYCL, BLAS, CANN, MUSA, OpenVINO, OpenCL, and RPC. Use the upstream-style build docs in docs/build.md and backend-specific pages under docs/backend.

Common Commands

Local CLI

llama-cli -m model.gguf
llama-cli -m model.gguf -cnv --chat-template chatml
llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p "Request: schedule a call at 8pm; Command:"

OpenAI-Compatible Server

llama-server -m model.gguf --port 8080
llama-server -m model.gguf -c 16384 -np 4
llama-server -m model.gguf -md draft.gguf

DFlash And TurboQuant Together

llama-server -m target.gguf --spec-type dflash \
  --spec-draft-model drafter.gguf \
  --spec-draft-ngl all \
  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq

Documentation

Contributing

Keep PRs small and scoped. Run the narrowest relevant tests or benchmarks before opening a PR, and include the exact commands. For fork-specific speculative decoding, DFlash, TurboQuant, or reasoning-loop changes, update the corresponding docs when behavior or args change.

Read CONTRIBUTING.md for inherited llama.cpp contribution conventions and this fork's AI usage policy.

Dependencies

yhirose/cpp-httplib - single-header HTTP server used by llama-server - MIT
stb-image - single-header image decoder used by multimodal code - public domain
nlohmann/json - single-header JSON library - MIT
miniaudio.h - single-header audio decoder - public domain
subprocess.h - process launching helper - public domain
Snowflake ArcticInference - suffix tree and int32 map used in speculative decoding (common/suffix-tree.*, common/int32-map.h) - Apache-2.0
Intel OpenVINO - frontend header used in OpenVINO backend (ggml/src/ggml-openvino/openvino/frontend.h) - Apache-2.0
Intel SYCL/oneAPI - SYCL backend (ggml/src/ggml-sycl/) - Apache-2.0 WITH LLVM-exception

See the licenses/ directory for full license texts.

Name		Name	Last commit message	Last commit date
Latest commit History 9,459 Commits
.devops		.devops
.gemini		.gemini
.github		.github
.pi/gg		.pi/gg
benches		benches
ci		ci
cmake		cmake
codebooks		codebooks
common		common
docs		docs
examples		examples
ggml		ggml
gguf-py		gguf-py
grammars		grammars
include		include
licenses		licenses
media		media
models		models
notes		notes
pocs		pocs
requirements		requirements
scripts		scripts
src		src
tests		tests
tools		tools
vendor		vendor
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.ecrc		.ecrc
.editorconfig		.editorconfig
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SD-080-benchmark-notes.txt		SD-080-benchmark-notes.txt
SECURITY.md		SECURITY.md
apply_turbo_cuda_v2.py		apply_turbo_cuda_v2.py
beellama.jpg		beellama.jpg
build-xcframework.sh		build-xcframework.sh
convert_hf_to_gguf.py		convert_hf_to_gguf.py
convert_hf_to_gguf_update.py		convert_hf_to_gguf_update.py
convert_llama_ggml_to_gguf.py		convert_llama_ggml_to_gguf.py
convert_lora_to_gguf.py		convert_lora_to_gguf.py
flake.lock		flake.lock
flake.nix		flake.nix
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
ty.toml		ty.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Anbeeld's BeeLlama.cpp

Plug-and-Play Setups

Fork Features

DFlash Speedup

Qwen 3.6 27B

Gemma 4 31B

KV Cache Quantization

Preset Ladder

Type Reference

Installation

Plug-and-Play Setups

Prebuilt (Windows)

CUDA Build

Other Backends

Common Commands

Local CLI

OpenAI-Compatible Server

DFlash And TurboQuant Together

Documentation

Contributing

Dependencies

About

Uh oh!

Releases 4

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Anbeeld's BeeLlama.cpp

Plug-and-Play Setups

Fork Features

DFlash Speedup

Qwen 3.6 27B

Gemma 4 31B

KV Cache Quantization

Preset Ladder

Type Reference

Installation

Plug-and-Play Setups

Prebuilt (Windows)

CUDA Build

Other Backends

Common Commands

Local CLI

OpenAI-Compatible Server

DFlash And TurboQuant Together

Documentation

Contributing

Dependencies

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 4

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages