Skip to content

Benchmark pipeline fails on macOS: AA scraper broken, HF 429s, community GGUF ID mismatch #83

@robertcprice

Description

@robertcprice

Environment

  • macOS Darwin 24.5.0, Apple M4 Pro (24GB unified)
  • whichllm 0.5.7 (latest PyPI)
  • Python 3.14

Three broken links in the data pipeline

1. Artificial Analysis scraper: __NEXT_DATA__ payload not found

The AA site no longer embeds __NEXT_DATA__ in the HTML. It's moved to client-side rendering. Every invocation gets:

AA Index fetch failed, will use fallback: __NEXT_DATA__ payload not found

Confirmed by fetching the page directly — no __NEXT_DATA__ script tag in the response HTML.

2. HuggingFace datasets API: 429 rate limits

Both the Open LLM Leaderboard and Chatbot Arena ELO endpoints return 429:

Leaderboard fetch failed: Client error '429 Too Many Requests' for url: 'https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard%2Fcontents...'
Arena fetch failed, using fallback: Client error '429 Too Many Requests' for url: 'https://datasets-server.huggingface.co/rows?dataset=mathewhe%2Fchatbot-arena-elo...'

No retry/backoff logic appears to be built in. After the first run caches stale data, --refresh hits the same 429s.

3. Community GGUF model IDs don't match benchmark entries (the critical one)

This is the most impactful bug. The cached benchmarks have scores keyed by official model IDs:

"Qwen/Qwen3.6-27B": 83.5
"Qwen/Qwen3.5-397B-A17B": 74.4

But on HuggingFace, the GGUF quantizations are uploaded by community members:

unsloth/Qwen3.6-27B-GGUF        (1.5M downloads)
unsloth/Qwen3.6-35B-A3B-GGUF    (2.0M downloads)
bartowski/Qwen_Qwen3.6-35B-A3B-GGUF

The ranker can't match unsloth/Qwen3.6-27B-GGUFQwen/Qwen3.6-27B, so these models get zero benchmark scores and are either excluded or ranked at the bottom.

Result: On a 24GB M4 Pro, the tool ranks Qwen/Qwen3-8B (score 63.1) as #1 while completely missing Qwen3.6-27B (83.5 on benchmarks, 1.5M downloads) because the GGUF is from unsloth/, not Qwen/.

The README example shows Qwen3.6-27B at score 92.8 for RTX 4090 — so this matching works in some cases but not others, likely dependent on whether the official org also uploaded GGUF variants.

Suggested fixes

  1. AA scraper: Switch to their API endpoint or use a different scraping strategy for the new client-rendered site.
  2. HF 429s: Add exponential backoff with jitter (3 retries, 2s/4s/8s base).
  3. ID matching: Strip org prefix and -GGUF/-gguf suffix, then fuzzy match against benchmark keys. unsloth/Qwen3.6-27B-GGUFQwen3.6-27B → match Qwen/Qwen3.6-27B. Also check cardData.base_model for community uploads.

Happy to PR any of these if welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions