Lemmit.Online bot

The original was posted on /r/singularity by /u/UnknownEssence on 2024-09-24 18:19:24+00:00.

GEMINI 1.5 PRO:

Capability	Benchmark	May 2024	Sep 2024
General	MMLU-Pro	69.0%	75.8%
Code	Natural2Code	82.6%	85.4%
Math	MATH	67.7%	86.5%
	HiddenMath	28.0%	52.0%
Reasoning	GPQA (diamond)	46.0%	59.1%
Multilingual	WMT23	75.3	75.1
Long Context	MRCR (1M)	70.5%	82.6%
Image	MMMU	62.2%	65.9%
	Vibe-Eval (Reka)	48.9%	53.9%
	MathVista	63.9%	68.1%
Audio	FLEURS (55 lang)	6.5%	6.7%
Video	Video-MME	77.9%	78.6%
Safety	XSTest	88.4%	98.8%

GEMINI 1.5 FLASH:

Capability	Benchmark	May 2024	Sep 2024
General	MMLU-Pro	59.1%	67.3%
Code	Natural2Code	77.2%	79.8%
Math	MATH	54.9%	77.9%
	HiddenMath	20.3%	47.2%
Reasoning	GPQA (diamond)	41.4%	51.0%
Multilingual	WMT23	74.1	73.9
Long Context	MRCR (1M)	70.1%	71.9%
Image	MMMU	56.1%	62.3%
	Vibe-Eval (Reka)	44.8%	48.9%
	MathVista	58.4%	65.8%
Audio	FLEURS (55 lang)	9.8%	9.6%
Video	Video-MME	74.7%	76.1%
Safety	XSTest	86.9%	97.0%

Benchmark performance of today's new Gemini model