Environment
- macOS Darwin 24.5.0, Apple M4 Pro (24GB unified)
- whichllm 0.5.7 (latest PyPI)
- Python 3.14
Three broken links in the data pipeline
1. Artificial Analysis scraper: __NEXT_DATA__ payload not found
The AA site no longer embeds __NEXT_DATA__ in the HTML. It's moved to client-side rendering. Every invocation gets:
AA Index fetch failed, will use fallback: __NEXT_DATA__ payload not found
Confirmed by fetching the page directly — no __NEXT_DATA__ script tag in the response HTML.
2. HuggingFace datasets API: 429 rate limits
Both the Open LLM Leaderboard and Chatbot Arena ELO endpoints return 429:
Leaderboard fetch failed: Client error '429 Too Many Requests' for url: 'https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard%2Fcontents...'
Arena fetch failed, using fallback: Client error '429 Too Many Requests' for url: 'https://datasets-server.huggingface.co/rows?dataset=mathewhe%2Fchatbot-arena-elo...'
No retry/backoff logic appears to be built in. After the first run caches stale data, --refresh hits the same 429s.
3. Community GGUF model IDs don't match benchmark entries (the critical one)
This is the most impactful bug. The cached benchmarks have scores keyed by official model IDs:
"Qwen/Qwen3.6-27B": 83.5
"Qwen/Qwen3.5-397B-A17B": 74.4
But on HuggingFace, the GGUF quantizations are uploaded by community members:
unsloth/Qwen3.6-27B-GGUF (1.5M downloads)
unsloth/Qwen3.6-35B-A3B-GGUF (2.0M downloads)
bartowski/Qwen_Qwen3.6-35B-A3B-GGUF
The ranker can't match unsloth/Qwen3.6-27B-GGUF → Qwen/Qwen3.6-27B, so these models get zero benchmark scores and are either excluded or ranked at the bottom.
Result: On a 24GB M4 Pro, the tool ranks Qwen/Qwen3-8B (score 63.1) as #1 while completely missing Qwen3.6-27B (83.5 on benchmarks, 1.5M downloads) because the GGUF is from unsloth/, not Qwen/.
The README example shows Qwen3.6-27B at score 92.8 for RTX 4090 — so this matching works in some cases but not others, likely dependent on whether the official org also uploaded GGUF variants.
Suggested fixes
- AA scraper: Switch to their API endpoint or use a different scraping strategy for the new client-rendered site.
- HF 429s: Add exponential backoff with jitter (3 retries, 2s/4s/8s base).
- ID matching: Strip org prefix and
-GGUF/-gguf suffix, then fuzzy match against benchmark keys. unsloth/Qwen3.6-27B-GGUF → Qwen3.6-27B → match Qwen/Qwen3.6-27B. Also check cardData.base_model for community uploads.
Happy to PR any of these if welcome.
Environment
Three broken links in the data pipeline
1. Artificial Analysis scraper:
__NEXT_DATA__ payload not foundThe AA site no longer embeds
__NEXT_DATA__in the HTML. It's moved to client-side rendering. Every invocation gets:Confirmed by fetching the page directly — no
__NEXT_DATA__script tag in the response HTML.2. HuggingFace datasets API: 429 rate limits
Both the Open LLM Leaderboard and Chatbot Arena ELO endpoints return 429:
No retry/backoff logic appears to be built in. After the first run caches stale data,
--refreshhits the same 429s.3. Community GGUF model IDs don't match benchmark entries (the critical one)
This is the most impactful bug. The cached benchmarks have scores keyed by official model IDs:
But on HuggingFace, the GGUF quantizations are uploaded by community members:
The ranker can't match
unsloth/Qwen3.6-27B-GGUF→Qwen/Qwen3.6-27B, so these models get zero benchmark scores and are either excluded or ranked at the bottom.Result: On a 24GB M4 Pro, the tool ranks
Qwen/Qwen3-8B(score 63.1) as #1 while completely missingQwen3.6-27B(83.5 on benchmarks, 1.5M downloads) because the GGUF is fromunsloth/, notQwen/.The README example shows
Qwen3.6-27Bat score 92.8 for RTX 4090 — so this matching works in some cases but not others, likely dependent on whether the official org also uploaded GGUF variants.Suggested fixes
-GGUF/-ggufsuffix, then fuzzy match against benchmark keys.unsloth/Qwen3.6-27B-GGUF→Qwen3.6-27B→ matchQwen/Qwen3.6-27B. Also checkcardData.base_modelfor community uploads.Happy to PR any of these if welcome.