Benchmark pipeline fails on macOS: AA scraper broken, HF 429s, community GGUF ID mismatch

## Environment
- macOS Darwin 24.5.0, Apple M4 Pro (24GB unified)
- whichllm 0.5.7 (latest PyPI)
- Python 3.14

## Three broken links in the data pipeline

### 1. Artificial Analysis scraper: `__NEXT_DATA__ payload not found`

The AA site no longer embeds `__NEXT_DATA__` in the HTML. It's moved to client-side rendering. Every invocation gets:

```
AA Index fetch failed, will use fallback: __NEXT_DATA__ payload not found
```

Confirmed by fetching the page directly — no `__NEXT_DATA__` script tag in the response HTML.

### 2. HuggingFace datasets API: 429 rate limits

Both the Open LLM Leaderboard and Chatbot Arena ELO endpoints return 429:

```
Leaderboard fetch failed: Client error '429 Too Many Requests' for url: 'https://datasets-server.huggingface.co/rows?dataset=open-llm-leaderboard%2Fcontents...'
Arena fetch failed, using fallback: Client error '429 Too Many Requests' for url: 'https://datasets-server.huggingface.co/rows?dataset=mathewhe%2Fchatbot-arena-elo...'
```

No retry/backoff logic appears to be built in. After the first run caches stale data, `--refresh` hits the same 429s.

### 3. Community GGUF model IDs don't match benchmark entries (the critical one)

This is the most impactful bug. The cached benchmarks have scores keyed by official model IDs:

```python
"Qwen/Qwen3.6-27B": 83.5
"Qwen/Qwen3.5-397B-A17B": 74.4
```

But on HuggingFace, the GGUF quantizations are uploaded by community members:

```
unsloth/Qwen3.6-27B-GGUF        (1.5M downloads)
unsloth/Qwen3.6-35B-A3B-GGUF    (2.0M downloads)
bartowski/Qwen_Qwen3.6-35B-A3B-GGUF
```

The ranker can't match `unsloth/Qwen3.6-27B-GGUF` → `Qwen/Qwen3.6-27B`, so these models get zero benchmark scores and are either excluded or ranked at the bottom.

**Result**: On a 24GB M4 Pro, the tool ranks `Qwen/Qwen3-8B` (score 63.1) as #1 while completely missing `Qwen3.6-27B` (83.5 on benchmarks, 1.5M downloads) because the GGUF is from `unsloth/`, not `Qwen/`.

The README example shows `Qwen3.6-27B` at score 92.8 for RTX 4090 — so this matching works in some cases but not others, likely dependent on whether the official org also uploaded GGUF variants.

## Suggested fixes

1. **AA scraper**: Switch to their API endpoint or use a different scraping strategy for the new client-rendered site.
2. **HF 429s**: Add exponential backoff with jitter (3 retries, 2s/4s/8s base).
3. **ID matching**: Strip org prefix and `-GGUF`/`-gguf` suffix, then fuzzy match against benchmark keys. `unsloth/Qwen3.6-27B-GGUF` → `Qwen3.6-27B` → match `Qwen/Qwen3.6-27B`. Also check `cardData.base_model` for community uploads.

Happy to PR any of these if welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark pipeline fails on macOS: AA scraper broken, HF 429s, community GGUF ID mismatch #83

Environment

Three broken links in the data pipeline

1. Artificial Analysis scraper: `__NEXT_DATA__ payload not found`

2. HuggingFace datasets API: 429 rate limits

3. Community GGUF model IDs don't match benchmark entries (the critical one)

Suggested fixes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Benchmark pipeline fails on macOS: AA scraper broken, HF 429s, community GGUF ID mismatch #83

Description

Environment

Three broken links in the data pipeline

1. Artificial Analysis scraper: __NEXT_DATA__ payload not found

2. HuggingFace datasets API: 429 rate limits

3. Community GGUF model IDs don't match benchmark entries (the critical one)

Suggested fixes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Artificial Analysis scraper: `__NEXT_DATA__ payload not found`