PodcastsBusinessLatent Space: The AI Engineer Podcast

Latent Space: The AI Engineer Podcast

swyx + Alessio
Latent Space: The AI Engineer Podcast
Latest episode

186 episodes

  • Latent Space: The AI Engineer Podcast

    ⚡️ Prism: OpenAI's LaTeX "Cursor for Scientists" — Kevin Weil & Victor Powell, OpenAI for Science

    27/01/2026 | 35 mins.
    From building Crixet in stealth (so stealthy Kevin had to hunt down Victor on Reddit to explore an acquisition) to launching Prism (https://openai.com/prism/) as OpenAI's free AI-native LaTeX editor, Kevin Weil (VP of OpenAI for Science) and Victor Powell (Product Lead on Prism) are embedding frontier reasoning models like GPT 5.2 directly into the scientific publishing workflow—turning weeks of LaTeX wrestling into minutes of natural language instruction, and accelerating the path from research breakthrough to published paper.

    We discuss:

    What Prism is: a free AI-native LaTeX editor with GPT-5.2 embedded directly into the workflow (no copy-pasting between ChatGPT and Overleaf, the AI has full context on all your files)

    The origin story: Kevin found Victor's stealth company Cricket on a Reddit forum, DMed him out of the blue, and brought the team into OpenAI to build the scientific collaboration layer for AI acceleration

    Live demo highlights: proofreading an introduction paragraph-by-paragraph, converting a whiteboard commutative diagram photo into TikZ LaTeX code, generating 30 pages of general relativity lecture notes in seconds, and verifying complex symmetry equations in parallel chat sessions

    Why LaTeX is the bottleneck: scientists spend hours aligning diagrams, formatting equations, and managing references—time that should go to actual science, not typesetting

    The software engineering analogy: just like 2025 was the year AI moved from "early adopters only" to "you're falling behind if you're not using it" for coding, 2026 will be that year for science

    Why collaboration is built-in: unlimited collaborators for free (most LaTeX tools charge per seat), commenting, multi-line diff generation, and Monaco-based editor infrastructure

    The UI evolution thesis: today your document is front and center with AI on the side, but as models improve and trust increases, the primary interface becomes your conversation with the AI (the document becomes secondary verification)

    OpenAI for Science's mission: accelerate science by building frontier models and embedding them into scientific workflows (not just better models, but AI in the right places at the right time)

    The progression from SAT to open problems: two years ago GPT passed the SAT, then contest math, then graduate-level problems, then IMO Gold, and now it's solving open problems at the frontier of math, physics, and biology

    Why robotic labs are the next bottleneck: as AI gets better at reasoning over the full literature and designing experiments, the constraint shifts from "can we think of the right experiment" to "can we run 100 experiments in parallel while we sleep"

    The in silico acceleration unlock: nuclear fusion simulations, materials science, drug discovery—fields where you can run thousands of simulations in parallel, feed results back to the reasoning model, and iterate before touching the real world

    Self-acceleration and the automated researcher: Jakub's public goal of an intern-level AI researcher by September 2026 (eight months away), and why that unlocks faster model improvement and faster science

    The vision: not to win Nobel Prizes ourselves, but for 100 scientists to win Nobel Prizes using our technology—and to compress 25 years of science into five by making every scientist faster



    Prism

    Try Prism: https://prism.openai.com (free, log in with your ChatGPT account)

    OpenAI for Science: https://openai.com/science

    Chapters

    00:00:00 Introduction: OpenAI Prism Launch and the AI for Science Mission
    00:00:42 Why LaTeX Needs AI: The Scientific Writing Bottleneck
    00:03:13 The Cricket Acquisition Story: From Reddit to OpenAI
    00:05:50 Live Demo: AI-Powered LaTeX Editing with GPT-5.2
    00:17:13 Engineering Challenges: Monaco, WebAssembly, and Backend Rendering
    00:18:19 The Future of Scientific UIs: From Document-First to AI-First
    00:15:51 Collaboration Features and Notebooks: The Next Integration
    00:21:02 AI for Science: From SAT Tests to Open Research Problems
    00:23:32 The Wet Lab Bottleneck: Robotic Labs and Experimental Acceleration
    00:33:08 Self-Acceleration and the Automated AI Researcher by September 2026
  • Latent Space: The AI Engineer Podcast

    Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

    23/01/2026 | 1h 32 mins.
    From shipping Gemini Deep Think and IMO Gold to launching the Reasoning and AGI team in Singapore, Yi Tay has spent the last 18 months living through the full arc of Google DeepMind's pivot from architecture research to RL-driven reasoning—watching his team go from a dozen researchers to 300+, training models that solve International Math Olympiad problems in a live competition, and building the infrastructure to scale deep thinking across every domain, and driving Gemini to the top of the leaderboards across every category. Yi Returns to dig into the inside story of the IMO effort and more!

    We discuss:

    Yi's path: Brain → Reka → Google DeepMind → Reasoning and AGI team Singapore, leading model training for Gemini Deep Think and IMO Gold

    The IMO Gold story: four co-captains (Yi in Singapore, Jonathan in London, Jordan in Mountain View, and Tong leading the overall effort), training the checkpoint in ~1 week, live competition in Australia with professors punching in problems as they came out, and the tension of not knowing if they'd hit Gold until the human scores came in (because the Gold threshold is a percentile, not a fixed number)

    Why they threw away AlphaProof: "If one model can't do it, can we get to AGI?" The decision to abandon symbolic systems and bet on end-to-end Gemini with RL was bold and non-consensus

    On-policy vs. off-policy RL: off-policy is imitation learning (copying someone else's trajectory), on-policy is the model generating its own outputs, getting rewarded, and training on its own experience—"humans learn by making mistakes, not by copying"

    Why self-consistency and parallel thinking are fundamental: sampling multiple times, majority voting, LM judges, and internal verification are all forms of self-consistency that unlock reasoning beyond single-shot inference

    The data efficiency frontier: humans learn from 8 orders of magnitude less data than models, so where's the bug? Is it the architecture, the learning algorithm, backprop, off-policyness, or something else?

    Three schools of thought on world models: (1) Genie/spatial intelligence (video-based world models), (2) Yann LeCun's JEPA + FAIR's code world models (modeling internal execution state), (3) the amorphous "resolution of possible worlds" paradigm (curve-fitting to find the world model that best explains the data)

    Why AI coding crossed the threshold: Yi now runs a job, gets a bug, pastes it into Gemini, and relaunches without even reading the fix—"the model is better than me at this"

    The Pokémon benchmark: can models complete Pokédex by searching the web, synthesizing guides, and applying knowledge in a visual game state? "Efficient search of novel idea space is interesting, but we're not even at the point where models can consistently apply knowledge they look up"

    DSI and generative retrieval: re-imagining search as predicting document identifiers with semantic tokens, now deployed at YouTube (symmetric IDs for RecSys) and Spotify

    Why RecSys and IR feel like a different universe: "modeling dynamics are strange, like gravity is different—you hit the shuttlecock and hear glass shatter, cause and effect are too far apart"

    The closed lab advantage is increasing: the gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that play well with everything built before

    Why ideas still matter: "the last five years weren't just blind scaling—transformers, pre-training, RL, self-consistency, all had to play well together to get us here"

    Gemini Singapore: hiring for RL and reasoning researchers, looking for track record in RL or exceptional achievement in coding competitions, and building a small, talent-dense team close to the frontier



    Yi Tay

    Google DeepMind: https://deepmind.google

    X: https://x.com/YiTayML

    Chapters

    00:00:00 Introduction: Returning to Google DeepMind and the Singapore AGI Team
    00:04:52 The Philosophy of On-Policy RL: Learning from Your Own Mistakes
    00:12:00 IMO Gold Medal: The Journey from AlphaProof to End-to-End Gemini
    00:21:33 Training IMO Cat: Four Captains Across Three Time Zones
    00:26:19 Pokemon and Long-Horizon Reasoning: Beyond Academic Benchmarks
    00:36:29 AI Coding Assistants: From Lazy to Actually Useful
    00:32:59 Reasoning, Chain of Thought, and Latent Thinking
    00:44:46 Is Attention All You Need? Architecture, Learning, and the Local Minima
    00:55:04 Data Efficiency and World Models: The Next Frontier
    01:08:12 DSI and Generative Retrieval: Reimagining Search with Semantic IDs
    01:17:59 Building GDM Singapore: Geography, Talent, and the Symposium
    01:24:18 Hiring Philosophy: High Stats, Research Taste, and Student Budgets
    01:28:49 Health, HRV, and Research Performance: The 23kg Journey
  • Latent Space: The AI Engineer Podcast

    Brex’s AI Hail Mary — With CTO James Reggio

    17/01/2026 | 1h 13 mins.
    From building internal AI labs to becoming CTO of Brex, James Reggio has helped lead one of the most disciplined AI transformations inside a real financial institution where compliance, auditability, and customer trust actually matter.

    We sat down with Reggio to unpack Brex’s three-pillar AI strategy (corporate, operational, and product AI) [https://www.brex.com/journal/brex-ai-native-operations], how SOP-driven agents beat overengineered RL in ops, why Brex lets employees “build their own AI stack” instead of picking winners [https://www.conductorone.com/customers/brex/], and how a small, founder-heavy AI team is shipping production agents to 40,000+ companies. Reggio also goes deep on Brex’s multi-agent “network” architecture, evals for multi-turn systems, agentic coding’s second-order effects on codebase understanding, and why the future of finance software looks less like dashboards and more like executive assistants coordinating specialist agents behind the scenes.

    We discuss:

    Brex’s three-pillar AI strategy: corporate AI for 10x employee workflows, operational AI for cost and compliance leverage, and product AI that lets customers justify Brex as part of their AI strategy to the board

    Why SOP-driven agents beat overengineered RL in finance ops, and how breaking work into auditable, repeatable steps unlocked faster automation in KYC, underwriting, fraud, and disputes

    Building an internal AI platform early: LLM gateways, prompt/version management, evals, cost observability, and why platform work quietly became the force multiplier behind everything else

    Multi-agent “networks” vs single-agent tools: why Brex’s EA-style assistant coordinates specialist agents (policy, travel, reimbursements) through multi-turn conversations instead of one-shot tool calls

    The audit agent pattern: separating detection, judgment, and follow-up into different agents to reduce false negatives without overwhelming finance teams

    Centralized AI teams without resentment: how Brex avoided “AI envy” by tying work to business impact and letting anyone transfer in if they cared deeply enough

    Letting employees build their own AI stack: ChatGPT vs Claude vs Gemini, Cursor vs Windsurf, and why Brex refuses to pick winners in fast-moving tool races

    Measuring adoption without vanity metrics: why “% of code written by AI” is the wrong KPI and what second-order effects (slop, drift, code ownership) actually matter

    Evals in the real world: regression tests from ops QA, LLM-as-judge for multi-turn agents, and why integration-style evals break faster than you expect

    Teaching AI fluency at scale: the user → advocate → builder → native framework, ops-led training, spot bonuses, and avoiding fear-based adoption

    Re-interviewing the entire engineering org: using agentic coding interviews internally to force hands-on skill upgrades without formal performance scoring

    Headcount in the age of agents: why Brex grew the business without growing engineering, and why AI amplifies bad architecture as fast as good decisions

    The future of finance software: why dashboards fade, assistants take over, and agent-to-agent collaboration becomes the real UI



    James Reggio

    X: https://x.com/jamesreggio

    LinkedIn: https://www.linkedin.com/in/jamesreggio/

    Where to find Latent Space

    X: https://x.com/latentspacepod

    Substack: https://www.latent.space/

    Chapters

    00:00:00 Introduction
    00:01:24 From Mobile Engineer to CTO: The Founder's Path
    00:03:00 Quitters Welcome: Building a Founder-Friendly Culture
    00:05:13 The AI Team Structure: 10-Person Startup Within Brex
    00:11:55 Building the Brex Agent Platform: Multi-Agent Networks
    00:13:45 Tech Stack Decisions: TypeScript, Mastra, and MCP
    00:24:32 Operational AI: Automating Underwriting, KYC, and Fraud
    00:16:40 The Brex Assistant: Executive Assistant for Every Employee
    00:40:26 Evaluation Strategy: From Simple SOPs to Multi-Turn Evals
    00:37:11 Agentic Coding Adoption: Cursor, Windsurf, and the Engineering Interview
    00:58:51 AI Fluency Levels: From User to Native
    01:09:14 The Audit Agent Network: Finance Team Agents in Action
    01:03:33 The Future of Engineering Headcount and AI Leverage
  • Latent Space: The AI Engineer Podcast

    Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

    09/01/2026 | 1h 18 mins.
    don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

    —-

    From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah Hill-Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

    We discuss:

    The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet

    Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers

    The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints

    How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)

    The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs

    Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest

    GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)

    The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)

    The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)

    Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future

    Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)

    V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)



    Artificial Analysis

    Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))

    George Cameron on X: https://x.com/grmcameron (https://x.com/grmcameron (\"https://x.com/grmcameron\"))

    Micah Hill-Smith on X: https://x.com/_micah_h (https://x.com/_micah_h (\"https://x.com/_micah_h\"))

    Chapters

    00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
    00:01:08 Business Model: Independence and Revenue Streams
    00:04:00 The Origin Story: From Legal AI to Benchmarking
    00:07:00 Early Challenges: Cost, Methodology, and Independence
    00:16:13 AI Grant and Moving to San Francisco
    00:18:58 Evolution of the Intelligence Index: V1 to V3
    00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
    00:33:19 Critical Point and Frontier Physics Problems
    00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
    00:51:47 The Openness Index: Measuring Model Transparency
    00:57:57 The Smiling Curve: Cost of Intelligence Paradox
    01:04:00 Hardware Efficiency and Sparsity Trends
    01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
    01:10:47 Multimodal Benchmarking and Community Requests
    01:14:50 Looking Ahead: V4 Intelligence Index and Beyond
  • Latent Space: The AI Engineer Podcast

    Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

    08/01/2026 | 1h 18 mins.
    don’t miss George’s AIE talk: https://www.youtube.com/watch?v=sRpqPgKeXNk

    —-

    From launching a side project in a Sydney basement to becoming the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities—George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is "open" really?

    We discuss:

    The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx's retweet

    Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers

    The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints

    How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard)

    The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs

    Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \"I don't know\"), and Claude models lead with the lowest hallucination rates despite not always being the smartest

    GDP Val AA: their version of OpenAI's GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias)

    The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron)

    The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents)

    Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future

    Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions)

    V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models)



    Artificial Analysis

    Website: https://artificialanalysis.ai (https://artificialanalysis.ai (\"https://artificialanalysis.ai\"))

    George Cameron on X: https://x.com/georgecameron (https://x.com/georgecameron (\"https://x.com/georgecameron\"))

    Micah-Hill Smith on X: https://x.com/micahhsmith (https://x.com/micahhsmith (\"https://x.com/micahhsmith\"))

    Chapters

    00:00:00 Introduction: Full Circle Moment and Artificial Analysis Origins
    00:01:08 Business Model: Independence and Revenue Streams
    00:04:00 The Origin Story: From Legal AI to Benchmarking
    00:07:00 Early Challenges: Cost, Methodology, and Independence
    00:16:13 AI Grant and Moving to San Francisco
    00:18:58 Evolution of the Intelligence Index: V1 to V3
    00:27:55 New Benchmarks: Hallucination Rate and Omissions Index
    00:33:19 Critical Point and Frontier Physics Problems
    00:35:56 GDPVAL AA: Agentic Evaluation and Stirrup Harness
    00:51:47 The Openness Index: Measuring Model Transparency
    00:57:57 The Smiling Curve: Cost of Intelligence Paradox
    01:04:00 Hardware Efficiency and Sparsity Trends
    01:07:43 Reasoning vs Non-Reasoning: Token Efficiency Matters
    01:10:47 Multimodal Benchmarking and Community Requests
    01:14:50 Looking Ahead: V4 Intelligence Index and Beyond

More Business podcasts

About Latent Space: The AI Engineer Podcast

The podcast by and for AI Engineers! In 2024, over 2 million readers and listeners came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, Anthropic, Gemini, Meta (Soumith Chintala), Sierra (Bret Taylor), tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space
Podcast website

Listen to Latent Space: The AI Engineer Podcast, She's On The Money and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v8.3.1 | © 2007-2026 radio.de GmbH
Generated: 1/28/2026 - 12:30:14 PM