Leaderboard
Aggregate of contamination-resistant benchmarks — IFEval, BBH, MATH, GPQA, MUSR, MMLU-Pro — for open-weight language models. The final snapshot of Hugging Face's Open LLM Leaderboard.
| # | Model | Avg. | IFEval | MATH | GPQA |
|---|---|---|---|---|---|
| 1 | MaziyarPanahi/calme-3.2-instruct-78b | 52.1 | 80.6 | 40.3 | 20.4 |
| 2 | MaziyarPanahi/calme-3.1-instruct-78b | 51.3 | 81.4 | 39.3 | 19.5 |
| 3 | dfurman/CalmeRys-78B-Orpo-v0.1 | 51.2 | 81.6 | 40.6 | 20 |
| 4 | MaziyarPanahi/calme-2.4-rys-78b | 50.8 | 80.1 | 40.7 | 20.4 |
| 5 | huihui-ai/Qwen2.5-72B-Instruct-abliterated | 48.1 | 85.9 | 60.1 | 19.4 |
| 6 | Qwen/Qwen2.5-72B-Instruct Qwen · Alibaba | 48.0 | 86.4 | 59.8 | 16.7 |
| 7 | MaziyarPanahi/calme-2.1-qwen2.5-72b | 47.9 | 86.6 | 59.1 | 15.1 |
| 8 | newsbang/Homer-v1.0-Qwen2.5-72B | 47.5 | 76.3 | 49 | 22.1 |
| 9 | Saxo/Linkbricks-Horizon-AI-Avengers-V1-32B | 47.3 | 79.7 | 60.3 | 15 |
| 10 | MaziyarPanahi/calme-2.2-qwen2.5-72b | 47.2 | 84.8 | 58.9 | 14.5 |
| 11 | fluently-lm/FluentlyLM-Prinum | 47.2 | 80.9 | 54 | 18.2 |
| 12 | JungZoona/T3Q-Qwen2.5-14B-Instruct-1M-e3 | 47.1 | 73.2 | 28.6 | 22.3 |
| 13 | JungZoona/T3Q-qwen2.5-14b-v1.0-e3 | 47.1 | 73.2 | 28.6 | 22.3 |
| 14 | zetasepic/Qwen2.5-32B-Instruct-abliterated-v2 | 46.9 | 83.3 | 59.5 | 15.7 |
| 15 | rubenroy/Gilgamesh-72B | 46.8 | 84.9 | 43.8 | 19.2 |
| 16 | Sakalti/ultiima-72B | 46.8 | 71.4 | 53.5 | 21.9 |
| 17 | CombinHorizon/zetasepic-abliteratedV2-Qwen2.5-32B-Inst-BaseMerge-TIES | 46.8 | 83.3 | 58.5 | 15.7 |
| 18 | maldv/Awqward2.5-32B-Instruct | 46.7 | 82.5 | 62.3 | 12.1 |
| 19 | raphgg/test-2.5-72B | 46.7 | 84.4 | 41.1 | 18.6 |
| 20 | shuttleai/shuttle-3 | 46.7 | 81.5 | 46 | 21.6 |
| 21 | Qwen/Qwen2.5-32B-Instruct Qwen · Alibaba | 46.6 | 83.5 | 62.5 | 11.7 |
| 22 | mistralai/Mistral-Large-Instruct-2411 Mistral AI | 46.5 | 84 | 49.5 | 24.9 |
| 23 | rombodawg/Rombos-LLM-V2.5-Qwen-72b | 46.5 | 71.6 | 54.2 | 19.8 |
| 24 | Saxo/Linkbricks-Horizon-AI-Avengers-V3-32B | 46.4 | 82.5 | 61.8 | 11.7 |
| 25 | zetasepic/Qwen2.5-72B-Instruct-abliterated | 46.3 | 71.5 | 52.4 | 20.9 |
Leaderboard data is cached from Open LLM Leaderboard and refreshed every 12 hours. Scores shown are for reference; see the source for full methodology.