Covering the biggest news of the century - the arrival of smarter-than-human AI. The author of Simple Bench, exposing the remaining human-LLM reasoning gap.
Solo Developer of LM Council: lmcouncil.ai - Code - INSIDER15
Join me at AI Insiders, with exclusive videos and a 1000+ network of AI enthusiasts and professionals: www.patreon.com/AIExplained
AI Explained
Gemini 3 is the real deal, and I had the video to showcase just that. Then, at the end of a 15 hour day 32GB of RAM just couldn't handle the tabs I was showing on screen (plus Antigravity), and the software was too poor to auto-back-up the recording. So will return on the morrow, with the more reliable OBS Studio - and the full Gemini 3 Pro/Deep Think breakdown. Suffice to say, Google has the lead in AI, not just in SimpleBench, and now that the giant is dancing it simply may not stop.
3 months ago | [YT] | 1,837
View 132 replies
AI Explained
So, o1 is good. Not ‘human-level reasoner’ good, or ‘threat to humanity’ good, but better than I thought ‘Strawberry’ would be. o1-preview correctly answers around half of the Simple-Bench questions, noticeably outstripping Claude 3.5 Sonnet - it’s a step-change improvement, not an incremental one. A few caveats and comments though.
1) It frequently - and sometimes predictably - makes really obvious mistakes: not calculation oopsies, truly non-human blindspots. ‘When the cup is turned upside-down, the dice will fall and land on the open end of the cup, which is now the top’, and ‘he will argue back against the Brigadier-General at the troop parade, as his silly behavior in first-grade indicates a history of speaking up against authority figures’. In some domains, these mistakes are routine, and pretty amusing.
2) Its performance on Simple is a far cry from the ≈80% on GPQA, an ostensibly far, far 'harder' benchmark. o1-preview can simultaneously be said to have an incredible ceiling of performance (Gold on the Math Olympiad in certain settings!) and also a dramatically low floor, far lower than even an average-intelligence human.
3) Why no fixed percent score? It’s coming! But quite a few of the models outputs fail to complete, even after 10 minutes, its temp is strangely fixed at 1 (just for now, hopefully?), and I was intending to finally establish a majority vote score for all models, for true apples-to-apples. My dream is to have this done by the end of September. I didn’t want you guys to wait until then for something though.
4) A video tonight? Alas not, as I am currently abroad and it is well past midnight here. Myself and the Senior FAANG ML Engineer who oversees the runs did manage to bench o1-preview 3 times (mini score terribly btw, despite the charts you see for AIME), and I analyzed every question/output, and read the o1 system card - wowsers - and every OpenAI post, but am truly too exhausted for a full video, which would take me to like 5am.
So, for now, enjoy the fruit. It tastes better than I thought.
1 year ago | [YT] | 1,908
View 101 replies
AI Explained
Gemini 1.0 is here, deep diving as we speak! 90% MMLU, not bad...
2 years ago | [YT] | 790
View 85 replies
AI Explained
Sam Altman has left the building, fired for 'lying'
www.theguardian.com/technology/2023/nov/17/openai-…
2 years ago (edited) | [YT] | 281
View 70 replies