Leaderboard

Open LLM

Aggregate of contamination-resistant benchmarks — IFEval, BBH, MATH, GPQA, MUSR, MMLU-Pro — for open-weight language models. The final snapshot of Hugging Face's Open LLM Leaderboard.

Source: Open LLM Leaderboard · Updated Jun 17, 2026 · 25 models

#	Model	Avg.	IFEval	MATH	GPQA
1	MaziyarPanahi/calme-3.2-instruct-78b	52.1	80.6	40.3	20.4
2	MaziyarPanahi/calme-3.1-instruct-78b	51.3	81.4	39.3	19.5
3	dfurman/CalmeRys-78B-Orpo-v0.1	51.2	81.6	40.6	20
4	MaziyarPanahi/calme-2.4-rys-78b	50.8	80.1	40.7	20.4
5	huihui-ai/Qwen2.5-72B-Instruct-abliterated	48.1	85.9	60.1	19.4
6	Qwen/Qwen2.5-72B-Instruct Qwen · Alibaba	48.0	86.4	59.8	16.7
7	MaziyarPanahi/calme-2.1-qwen2.5-72b	47.9	86.6	59.1	15.1
8	newsbang/Homer-v1.0-Qwen2.5-72B	47.5	76.3	49	22.1
9	Saxo/Linkbricks-Horizon-AI-Avengers-V1-32B	47.3	79.7	60.3	15
10	MaziyarPanahi/calme-2.2-qwen2.5-72b	47.2	84.8	58.9	14.5
11	fluently-lm/FluentlyLM-Prinum	47.2	80.9	54	18.2
12	JungZoona/T3Q-Qwen2.5-14B-Instruct-1M-e3	47.1	73.2	28.6	22.3
13	JungZoona/T3Q-qwen2.5-14b-v1.0-e3	47.1	73.2	28.6	22.3
14	zetasepic/Qwen2.5-32B-Instruct-abliterated-v2	46.9	83.3	59.5	15.7
15	rubenroy/Gilgamesh-72B	46.8	84.9	43.8	19.2
16	Sakalti/ultiima-72B	46.8	71.4	53.5	21.9
17	CombinHorizon/zetasepic-abliteratedV2-Qwen2.5-32B-Inst-BaseMerge-TIES	46.8	83.3	58.5	15.7
18	maldv/Awqward2.5-32B-Instruct	46.7	82.5	62.3	12.1
19	raphgg/test-2.5-72B	46.7	84.4	41.1	18.6
20	shuttleai/shuttle-3	46.7	81.5	46	21.6
21	Qwen/Qwen2.5-32B-Instruct Qwen · Alibaba	46.6	83.5	62.5	11.7
22	mistralai/Mistral-Large-Instruct-2411 Mistral AI	46.5	84	49.5	24.9
23	rombodawg/Rombos-LLM-V2.5-Qwen-72b	46.5	71.6	54.2	19.8
24	Saxo/Linkbricks-Horizon-AI-Avengers-V3-32B	46.4	82.5	61.8	11.7
25	zetasepic/Qwen2.5-72B-Instruct-abliterated	46.3	71.5	52.4	20.9

Leaderboard data is cached from Open LLM Leaderboard and refreshed every 12 hours. Scores shown are for reference; see the source for full methodology.

Leaderboard

Open LLM

Aggregate of contamination-resistant benchmarks — IFEval, BBH, MATH, GPQA, MUSR, MMLU-Pro — for open-weight language models. The final snapshot of Hugging Face's Open LLM Leaderboard.

Source: Open LLM Leaderboard · Updated Jun 17, 2026 · 25 models

#	Model	Avg.	IFEval	MATH	GPQA
1	MaziyarPanahi/calme-3.2-instruct-78b	52.1	80.6	40.3	20.4
2	MaziyarPanahi/calme-3.1-instruct-78b	51.3	81.4	39.3	19.5
3	dfurman/CalmeRys-78B-Orpo-v0.1	51.2	81.6	40.6	20
4	MaziyarPanahi/calme-2.4-rys-78b	50.8	80.1	40.7	20.4
5	huihui-ai/Qwen2.5-72B-Instruct-abliterated	48.1	85.9	60.1	19.4
6	Qwen/Qwen2.5-72B-Instruct Qwen · Alibaba	48.0	86.4	59.8	16.7
7	MaziyarPanahi/calme-2.1-qwen2.5-72b	47.9	86.6	59.1	15.1
8	newsbang/Homer-v1.0-Qwen2.5-72B	47.5	76.3	49	22.1
9	Saxo/Linkbricks-Horizon-AI-Avengers-V1-32B	47.3	79.7	60.3	15
10	MaziyarPanahi/calme-2.2-qwen2.5-72b	47.2	84.8	58.9	14.5
11	fluently-lm/FluentlyLM-Prinum	47.2	80.9	54	18.2
12	JungZoona/T3Q-Qwen2.5-14B-Instruct-1M-e3	47.1	73.2	28.6	22.3
13	JungZoona/T3Q-qwen2.5-14b-v1.0-e3	47.1	73.2	28.6	22.3
14	zetasepic/Qwen2.5-32B-Instruct-abliterated-v2	46.9	83.3	59.5	15.7
15	rubenroy/Gilgamesh-72B	46.8	84.9	43.8	19.2
16	Sakalti/ultiima-72B	46.8	71.4	53.5	21.9
17	CombinHorizon/zetasepic-abliteratedV2-Qwen2.5-32B-Inst-BaseMerge-TIES	46.8	83.3	58.5	15.7
18	maldv/Awqward2.5-32B-Instruct	46.7	82.5	62.3	12.1
19	raphgg/test-2.5-72B	46.7	84.4	41.1	18.6
20	shuttleai/shuttle-3	46.7	81.5	46	21.6
21	Qwen/Qwen2.5-32B-Instruct Qwen · Alibaba	46.6	83.5	62.5	11.7
22	mistralai/Mistral-Large-Instruct-2411 Mistral AI	46.5	84	49.5	24.9
23	rombodawg/Rombos-LLM-V2.5-Qwen-72b	46.5	71.6	54.2	19.8
24	Saxo/Linkbricks-Horizon-AI-Avengers-V3-32B	46.4	82.5	61.8	11.7
25	zetasepic/Qwen2.5-72B-Instruct-abliterated	46.3	71.5	52.4	20.9

Leaderboard data is cached from Open LLM Leaderboard and refreshed every 12 hours. Scores shown are for reference; see the source for full methodology.