This is an automated archive made by the Lemmit Bot.

The original was posted on /r/singularity by /u/UnknownEssence on 2024-09-24 18:19:24+00:00.


GEMINI 1.5 PRO:

Capability Benchmark May 2024 Sep 2024
General MMLU-Pro 69.0% 75.8%
Code Natural2Code 82.6% 85.4%
Math MATH 67.7% 86.5%
HiddenMath 28.0% 52.0%
Reasoning GPQA (diamond) 46.0% 59.1%
Multilingual WMT23 75.3 75.1
Long Context MRCR (1M) 70.5% 82.6%
Image MMMU 62.2% 65.9%
Vibe-Eval (Reka) 48.9% 53.9%
MathVista 63.9% 68.1%
Audio FLEURS (55 lang) 6.5% 6.7%
Video Video-MME 77.9% 78.6%
Safety XSTest 88.4% 98.8%

GEMINI 1.5 FLASH:

Capability Benchmark May 2024 Sep 2024
General MMLU-Pro 59.1% 67.3%
Code Natural2Code 77.2% 79.8%
Math MATH 54.9% 77.9%
HiddenMath 20.3% 47.2%
Reasoning GPQA (diamond) 41.4% 51.0%
Multilingual WMT23 74.1 73.9
Long Context MRCR (1M) 70.1% 71.9%
Image MMMU 56.1% 62.3%
Vibe-Eval (Reka) 44.8% 48.9%
MathVista 58.4% 65.8%
Audio FLEURS (55 lang) 9.8% 9.6%
Video Video-MME 74.7% 76.1%
Safety XSTest 86.9% 97.0%