Powered by RND
PodcastsTechnologyLatent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
Listen to Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0 in the App
Listen to Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0 in the App
(3,100)(247,963)
Save favourites
Alarm
Sleep timer

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Podcast Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0
Alessio + swyx
The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover ...

Available Episodes

5 of 100
  • Windsurf: The Enterprise AI IDE - with Varun and Anshul of Codeium AI
    Our second podcast guest ever in March 2023 was Varun Mohan, CEO of Codeium; at the time, they had around 10,000 users and how they vowed to keep their autocomplete free forever: Today, over a million developers use their products, they still have their free tier, and they recently launched Windsurf, an AI IDE. Chapters* 00:00:00: Introductions & Catchup* 00:03:52: Why they created Windsurf* 00:05:52: Limitations of VS Code* 00:10:12: Evaluation methods for Cascade and Windsurf* 00:16:15: Listener questions about Windsurf launch* 00:20:30: Remote execution and security concerns* 00:25:18: Evolution of Codeium's strategy* 00:28:29: Cascade and its capabilities* 00:33:12: Multi-agent systems* 00:37:02: Areas of improvement for Windsurf* 00:39:12: Building an enterprise-first company* 00:42:01: Copilot for X, AI UX, and Enterprise AI blog posts Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:06:35
  • Generative Video WorldSim, Diffusion, Vision, Reinforcement Learning and Robotics — ICML 2024 Part 1
    Regular tickets are now sold out for Latent Space LIVE! at NeurIPS! We have just announced our last speaker and newest track, friend of the pod Nathan Lambert who will be recapping 2024 in Reasoning Models like o1! We opened up a handful of late bird tickets for those who are deciding now — use code DISCORDGANG if you need it. See you in Vancouver!We’ve been sitting on our ICML recordings for a while (from today’s first-ever SOLO guest cohost, Brittany Walker), and in light of Sora Turbo’s launch (blogpost, tutorials) today, we figured it would be a good time to drop part one which had been gearing up to be a deep dive into the state of generative video worldsim, with a seamless transition to vision (the opposite modality), and finally robots (their ultimate application).Sora, Genie, and the field of Generative Video World SimulatorsBill Peebles, author of Diffusion Transformers, gave his most recent Sora talk at ICML, which begins our episode:* William (Bill) Peebles - SORA (slides)Something that is often asked about Sora is how much inductive biases were introduced to achieve these results. Bill references the same principles brought by Hyung Won Chung from the o1 team - “sooner or later those biases come back to bite you”.We also recommend these reads from throughout 2024 on Sora.* Lilian Weng’s literature review of Video Diffusion Models* Sora API leak* Estimates of 100k-700k H100s needed to serve Sora (not Turbo)* Artist guides on using Sora for professional storytellingGoogle DeepMind had a remarkably strong presence at ICML on Video Generation Models, winning TWO Best Paper awards for:* Genie: Generative Interactive Environments (covered in oral, poster, and workshop)* VideoPoet: A Large Language Model for Zero-Shot Video Generation (see website)We end this part by taking in Tali Dekel’s talk on The Future of Video Generation: Beyond Data and Scale.Part 2: Generative Modeling and DiffusionSince 2023, Sander Dieleman’s perspectives (blogpost, tweet) on diffusion as “spectral autoregression in the frequency domain” while working on Imagen and Veo have caught the public imagination, so we highlight his talk:* Wading through the noise: an intuitive look at diffusion modelsThen we go to Ben Poole for his talk on Inferring 3D Structure with 2D Priors, including his work on NeRFs and DreamFusion:Then we investigate two flow matching papers - one from the Flow Matching co-authors - Ricky T. Q. Chen (FAIR, Meta)And how it is implemented in Stable Diffusion 3 with Scaling Rectified Flow Transformers for High-Resolution Image Synthesis Our last hit on Diffusion is a couple of oral presentations on speech, which we leave you to explore via our audio podcast* NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models* Speech Self-Supervised Learning Using Diffusion Model Synthetic DataPart 3: VisionThe ICML Test of Time winner was DeCAF, which Trevor Darrell notably called “the OG vision foundation model”.Lucas Beyer’s talk on “Vision in the age of LLMs — a data-centric perspective” was also well received online, and he talked about his journey from Vision Transformers to PaliGemma.We give special honorable mention to MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark.Part 4: Reinforcement Learning and RoboticsWe segue vision into robotics with the help of Ashley Edwards, whose work on both the Gato and the Genie teams at Deepmind is summarized in Learning actions, policies, rewards, and environments from videos alone.Brittany highlighted two poster session papers:* Behavior Generation with Latent Actions* We also recommend Lerrel Pinto’s On Building General-Purpose Robots* PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMsHowever we must give the lion’s share of space to Chelsea Finn, now founder of Physical Intelligence, who gave FOUR talks on* "What robots have taught me about machine learning"* developing robot generalists* robots that adapt autonomously* how to give feedback to your language model* special mention to PI colleague Sergey Levine on Robotic Foundation ModelsWe end the podcast with a position paper that links generative environments and RL/robotics: Automatic Environment Shaping is the Next Frontier in RL.Timestamps* [00:00:00] Intros* [00:02:43] Sora - Bill Peebles* [00:44:52] Genie: Generative Interactive Environments* [01:00:17] Genie interview* [01:12:33] VideoPoet: A Large Language Model for Zero-Shot Video Generation* [01:30:51] VideoPoet interview - Dan Kondratyuk* [01:42:00] Tali Dekel - The Future of Video Generation: Beyond Data and Scale.* [02:27:07] Sander Dieleman - Wading through the noise: an intuitive look at diffusion models* [03:06:20] Ben Poole - Inferring 3D Structure with 2D Priors* [03:30:30] Ricky Chen - Flow Matching* [04:00:03] Patrick Esser - Stable Diffusion 3* [04:14:30] NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models* [04:27:00] Speech Self-Supervised Learning Using Diffusion Model Synthetic Data* [04:39:00] ICML Test of Time winner: DeCAF* [05:03:40] Lucas Beyer: “Vision in the age of LLMs — a data-centric perspective”* [05:42:00] Ashley Edwards: Learning actions, policies, rewards, and environments from videos alone.* [06:03:30] Behavior Generation with Latent Actions interview* [06:09:52] Chelsea Finn: "What robots have taught me about machine learning"* [06:56:00] Position: Automatic Environment Shaping is the Next Frontier in RL Get full access to Latent Space at www.latent.space/subscribe
    --------  
    7:07:47
  • Bolt.new, Flow Engineering for Code Agents, and >$8m ARR in 2 months as a Claude Wrapper
    The full schedule for Latent Space LIVE! at NeurIPS has been announced, featuring Best of 2024 overview talks for the AI Startup Landscape, Computer Vision, Open Models, Transformers Killers, Synthetic Data, Agents, and Scaling, and speakers from Sarah Guo of Conviction, Roboflow, AI2/Meta, Recursal/Together, HuggingFace, OpenHands and SemiAnalysis. Join us for the IRL event/Livestream! Alessio will also be holding a meetup at AWS Re:Invent in Las Vegas this Wednesday. See our new Events page for dates of AI Engineer Summit, Singapore, and World’s Fair in 2025. LAST CALL for questions for our big 2024 recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!When we first observed that GPT Wrappers are Good, Actually, we did not even have Bolt on our radar. Since we recorded our Anthropic episode discussing building Agents with the new Claude 3.5 Sonnet, Bolt.new (by Stackblitz) has easily cleared the $8m ARR bar, repeating and accelerating its initial $4m feat.There are very many AI code generators and VS Code forks out there, but Bolt probably broke through initially because of its incredible zero shot low effort app generation:But as we explain in the pod, Bolt also emphasized deploy (Netlify)/ backend (Supabase)/ fullstack capabilities on top of Stackblitz’s existing WebContainer full-WASM-powered-developer-environment-in-the-browser tech. Since then, the team has been shipping like mad (with weekly office hours), with bugfixing, full screen, multi-device, long context, diff based edits (using speculative decoding like we covered in Inference, Fast and Slow).All of this has captured the imagination of low/no code builders like Greg Isenberg and many others on YouTube/TikTok/Reddit/X/Linkedin etc:Just as with Fireworks, our relationship with Bolt/Stackblitz goes a bit deeper than normal - swyx advised the launch and got a front row seat to this epic journey, as well as demoed it with Realtime Voice at the recent OpenAI Dev Day. So we are very proud to be the first/closest to tell the full open story of Bolt/Stackblitz!Flow Engineering + Qodo/AlphaCodium UpdateIn year 2 of the pod we have been on a roll getting former guests to return as guest cohosts (Harrison Chase, Aman Sanger, Jon Frankle), and it was a pleasure to catch Itamar Friedman back on the pod, giving us an update on all things Qodo and Testing Agents from our last catchup a year and a half ago:Qodo (they renamed in September) went viral in early January this year with AlphaCodium (paper here, code here) beating DeepMind’s AlphaCode with high efficiency:With a simple problem solving code agent:* The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.* Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output. * The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness. * Then, it generates more diverse tests for the problem, covering cases not part of the original public tests. * Iteratively, pick a solution, generate the code, and run it on a few test cases. * If the tests fail, improve the code and repeat the process until the code passes every test.swyx has previously written similar thoughts on types vs tests for putting bounds on program behavior, but AlphaCodium extends this to AI generated tests and code.More recently, Itamar has also shown that AlphaCodium’s techniques also extend well to the o1 models:Making Flow Engineering a useful technique to improve code model performance on every model. This is something we see AI Engineers uniquely well positioned to do compared to ML Engineers/Researchers.Full Video PodcastLike and subscribe!Show Notes* Itamar* Qodo* First episode* Eric* Bolt* StackBlitz* Thinkster* AlphaCodium* WebContainersChapters* 00:00:00 Introductions & Updates* 00:06:01 Generic vs. Specific AI Agents* 00:07:40 Maintaining vs Creating with AI* 00:17:46 Human vs Agent Computer Interfaces* 00:20:15 Why Docker doesn't work for Bolt* 00:24:23 Creating Testing and Code Review Loops* 00:28:07 Bolt's Task Breakdown Flow* 00:31:04 AI in Complex Enterprise Environments* 00:41:43 AlphaCodium* 00:44:39 Strategies for Breaking Down Complex Tasks* 00:45:22 Building in Open Source* 00:50:35 Choosing a product as a founder* 00:59:03 Reflections on Bolt Success* 01:06:07 Building a B2C GTM* 01:18:11 AI Capabilities and Pricing Tiers* 01:20:28 What makes Bolt unique* 01:23:07 Future Growth and Product Development* 01:29:06 Competitive Landscape in AI Engineering* 01:30:01 Advice to Founders and Embracing AI* 01:32:20 Having a baby and completing an Iron ManTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:12]: Hey, and today we're still in our sort of makeshift in-between studio, but we're very delighted to have a former returning guest host, Itamar. Welcome back.Itamar [00:00:21]: Great to be here after a year or more. Yeah, a year and a half.Swyx [00:00:24]: You're one of our earliest guests on Agents. Now you're CEO co-founder of Kodo. Right. Which has just been renamed. You also raised a $40 million Series A, and we can get caught up on everything, but we're also delighted to have our new guest, Eric. Welcome.Eric [00:00:42]: Thank you. Excited to be here. Should I say Bolt or StackBlitz?Swyx [00:00:45]: Like, is it like its own company now or?Eric [00:00:47]: Yeah. Bolt's definitely bolt.new. That's the thing that we're probably the most known for, I imagine, at this point.Swyx [00:00:54]: Which is ridiculous to say because you were working at StackBlitz for so long.Eric [00:00:57]: Yeah. I mean, within a week, we were doing like double the amount of traffic. And StackBlitz had been online for seven years, and we were like, what? But anyways, yeah. So we're StackBlitz, the company behind bolt.new. If you've heard of bolt.new, that's our stuff. Yeah.Swyx [00:01:12]: Yeah.Itamar [00:01:13]: Excellent. I see, by the way, that the founder mode, you need to know to capture opportunities. So kudos on doing that, right? You're working on some technology, and then suddenly you can exploit that to a new world. Yeah.Eric [00:01:24]: Totally. And I think, well, not to jump, but 100%, I mean, a couple of months ago, we had the idea for Bolt earlier this year, but we haven't really shared this too much publicly. But we actually had tried to build it with some of those state-of-the-art models back in January, February, you can kind of imagine which, and they just weren't good enough to actually do the code generation where the code was accurate and it was fast and whatever have you without a ton of like rag, but then there was like issues with that. So we put it on the shelf and then we got kind of a sneak peek of some of the new models that have come out in the past couple of months now. And so once we saw that, once we actually saw the code gen from it, we were like, oh my God, like, okay, we can build a product around this. And so that was really the impetus of us building the thing. But with that, it was StackBlitz, the core StackBlitz product the past seven years has been an IDE for developers. So the entire user experience flow we've built up just didn't make sense. And so when we kind of went out to build Bolt, we just thought, you know, if we were inventing our product today, what would the interface look like given what is now possible with the AI code gen? And so there's definitely a lot of conversations we had internally, but you know, just kind of when we logically laid it out, we were like, yeah, I think it makes sense to just greenfield a new thing and let's see what happens. If it works great, then we'll figure it out. If it doesn't work great, then it'll get deleted at some point. So that's kind of how it actually came to be.Swyx [00:02:49]: I'll mention your background a little bit. You were also founder of Thinkster before you started StackBlitz. So both of you are second time founders. Both of you have sort of re-founded your company recently. Yours was more of a rename. I think a slightly different direction as well. And then we can talk about both. Maybe just chronologically, should we get caught up on where Kodo is first and then you know, just like what people should know since the last pod? Sure.Itamar [00:03:12]: The last pod was two months after we launched and we basically had the vision that we talked about. The idea that software development is about specification, test and code, etc. We are more on the testing part as in essence, we think that if you solve testing, you solve software development. The beautiful chart that we'll put up on screen. And testing is a really big field, like there are many dimensions, unit testing, the level of the component, how big it is, how large it is. And then there is like different type of testing, is it regression or smoke or whatever. So back then we only had like one ID extension with unit tests as in focus. One and a half year later, first ID extension supports more type of testing as context aware. We index local, local repos, but also 10,000s of repos for Fortune 500 companies. We have another agent, another tool that is called, the pure agent is the open source and the commercial one is CodoMerge. And then we have another open source called CoverAgent, which is not yet a commercial product coming very soon. It's very impressive. It could be that already people are approving automated pull requests that they don't even aware in really big open sources. So once we have enough of these, we will also launch another agent. So for the first one and a half year, what we did is grew in our offering and mostly on the side of, does this code actually works, testing, code review, et cetera. And we believe that's the critical milestone that needs to be achieved to actually have the AI engineer for enterprise software. And then like for the first year was everything bottom up, getting to 1 million installation. 2024, that was 2023, 2024 was starting to monetize, to feel like how it is to make the first buck. So we did the teams offering, it went well with a thousand of teams, et cetera. And then we started like just a few months ago to do enterprise with everything you need, which is a lot of things that discussed in the last post that was just released by Codelm. So that's how we call it at Codelm. Just opening the brackets, our company name was Codelm AI, and we renamed to Codo and we call our models Codelm. So back to my point, so we started Enterprise Motion and already have multiple Fortune 100 companies. And then with that, we raised a series of $40 million. And what's exciting about it is that enables us to develop more agents. That's our focus. I think it's very different. We're not coming very soon with an ID or something like that.Swyx [00:06:01]: You don't want to fork this code?Itamar [00:06:03]: Maybe we'll fork JetBrains or something just to be different.Swyx [00:06:08]: I noticed that, you know, I think the promise of general purpose agents has kind of died. Like everyone is doing kind of what you're doing. There's Codogen, Codomerge, and then there's a third one. What's the name of it?Itamar [00:06:17]: Yeah. Codocover. Cover. Which is like a commercial version of a cover agent. It's coming soon.Swyx [00:06:23]: Yeah. It's very similar with factory AI, also doing like droids. They all have special purpose doing things, but people don't really want general purpose agents. Right. The last time you were here, we talked about AutoGBT, the biggest thing of 2023. This year, not really relevant anymore. And I think it's mostly just because when you give me a general purpose agent, I don't know what to do with it.Eric [00:06:42]: Yeah.Itamar [00:06:43]: I totally agree with that. We're seeing it for a while and I think it will stay like that despite the computer use, et cetera, that supposedly can just replace us. You can just like prompt it to be, hey, now be a QA or be a QA person or a developer. I still think that there's a few reasons why you see like a dedicated agent. Again, I'm a bit more focused, like my head is more on complex software for big teams and enterprise, et cetera. And even think about permissions and what are the data sources and just the same way you manage permissions for users. Developers, you probably want to have dedicated guardrails and dedicated approvals for agents. I intentionally like touched a point on not many people think about. And of course, then what you can think of, like maybe there's different tools, tool use, et cetera. But just the first point by itself is a good reason why you want to have different agents.Alessio [00:07:40]: Just to compare that with Bot.new, you're almost focused on like the application is very complex and now you need better tools to kind of manage it and build on top of it. On Bot.new, it's almost like I was using it the other day. There's basically like, hey, look, I'm just trying to get started. You know, I'm not very opinionated on like how you're going to implement this. Like this is what I want to do. And you build a beautiful app with it. What people ask as the next step, you know, going back to like the general versus like specific, have you had people say, hey, you know, this is great to start, but then I want a specific Bot.new dot whatever else to do a more vertical integration and kind of like development or what's the, what do people say?Eric [00:08:18]: Yeah. I think, I think you kind of hit the, hit it head on, which is, you know, kind of the way that we've, we've kind of talked about internally is it's like people are using Bolt to go from like 0.0 to 1.0, like that's like kind of the biggest unlock that Bolt has versus most other things out there. I mean, I think that's kind of what's, what's very unique about Bolt. I think the, you know, the working on like existing enterprise applications is, I mean, it's crazy important because, you know, there's a, you look, when you look at the fortune 500, I mean, these code bases, some of these have been around for 20, 30 plus years. And so it's important to be going from, you know, 101.3 to 101.4, et cetera. I think for us, so what's been actually pretty interesting is we see there's kind of two different users for us that are coming in and it's very distinct. It's like people that are developers already. And then there's people that have never really written software and more if they have, it's been very, very minimal. And so in the first camp, what these developers are doing, like to go from zero to one, they're coming to Bolt and then they're ejecting the thing to get up or just downloading it and, you know, opening cursor, like whatever to, to, you know, keep iterating on the thing. And sometimes they'll bring it back to Bolt to like add in a huge piece of functionality or something. Right. But for the people that don't know how to code, they're actually just, they, they live in this thing. And that was one of the weird things when we launched is, you know, within a day of us being online, one of the most popular YouTube videos, and there's been a ton since, which was, you know, there's like, oh, Bolt is the cursor killer. And I originally saw the headlines and I was like, thanks for the views. I mean, I don't know. This doesn't make sense to me. That's not, that's not what we kind of thought.Swyx [00:09:44]: It's how YouTubers talk to each other. Well, everything kills everything else.Eric [00:09:47]: Totally. But what blew my mind was that there was any comparison because it's like cursor is a, is a local IDE product. But when, when we actually kind of dug into it and we, and we have people that are using our product saying this, I'm not using cursor. And I was like, what? And it turns out there are hundreds of thousands of people that we have seen that we're using cursor and we're trying to build apps with that where they're not traditional software does, but we're heavily leaning on the AI. And as you can imagine, it is very complicated, right? To do that with cursor. So when Bolt came out, they're like, wow, this thing's amazing because it kind of inverts the complexity where it's like, you know, it's not an IDE, it's, it's a, it's a chat-based sort of interface that we have. So that's kind of the split, which is rather interesting. We've had like the first startups now launch off of Bolt entirely where this, you know, tomorrow I'm doing a live stream with this guy named Paul, who he's built an entire CRM using this thing and you know, with backend, et cetera. And people have made their first money on the internet period, you know, launching this with Stripe or whatever have you. So that's, that's kind of the two main, the two main categories of folks that we see using Bolt though.Itamar [00:10:51]: I agree that I don't understand the comparison. It doesn't make sense to me. I think like we have like two type of families of tools. One is like we re-imagine the software development. I think Bolt is there and I think like a cursor is more like a evolution of what we already have. It's like taking the IDE and it's, it's amazing and it's okay, let's, let's adapt the IDE to an era where LLMs can do a lot for us. And Bolt is more like, okay, let's rethink everything totally. And I think we see a few tools there, like maybe Vercel, Veo and maybe Repl.it in that area. And then in the area of let's expedite, let's change, let's, let's progress with what we already have. You can see Cursor and Kodo, but we're different between ourselves, Cursor and Kodo, but definitely I think that comparison doesn't make sense.Alessio [00:11:42]: And just to set the context, this is not a Twitter demo. You've made 4 million of revenue in four weeks. So this is, this is actually working, you know, it's not a, what, what do you think that is? Like, there's been so many people demoing coding agents on Twitter and then it doesn't really work. And then you guys were just like, here you go, it's live, go use it, pay us for it. You know, is there anything in the development that was like interesting and maybe how that compares to building your own agents?Eric [00:12:08]: We had no idea, honestly, like we, we, we've been pretty blown away and, and things have just kind of continued to grow faster since then. We're like, oh, today is week six. So I, I kind of came back to the point you just made, right, where it's, you, you kind of outlined, it's like, there's kind of this new market of like kind of rethinking the software development and then there's heavily augmenting existing developers. I think that, you know, both of which are, you know, AI code gen being extremely good, it's allowed existing developers, it's allowing existing developers to camera out software far faster than they could have ever before, right? It's like the ultimate power tool for an existing developer. But this code gen stuff is now so good. And then, and we saw this over the past, you know, from the beginning of the year when we tried to first build, it's actually lowered the barrier to people that, that aren't traditionally software engineers. But the kind of the key thing is if you kind of think about it from, imagine you've never written software before, right? My co-founder and I, he and I grew up down the street from each other in Chicago. We learned how to code when we were 13 together and we've been building stuff ever since. And this is back in like the mid 2000s or whatever, you know, there was nothing for free to learn from online on the internet and how to code. For our 13th birthdays, we asked our parents for, you know, O'Reilly books cause you couldn't get this at the library, right? And so instead of like an Xbox, we got, you know, programming books. But the hardest part for everyone learning to code is getting an environment set up locally, you know? And so when we built StackBlitz, like kind of the key thesis, like seven years ago, the insight we had was that, Hey, it seems like the browser has a lot of new APIs like WebAssembly and service workers, et cetera, where you could actually write an operating system that ran inside the browser that could boot in milliseconds. And you, you know, basically there's this missing capability of the web. Like the web should be able to build apps for the web, right? You should be able to build the web on the web. Every other platform has that, Visual Studio for Windows, Xcode for Mac. The web has no built in primitive for this. And so just like our built in kind of like nerd instinct on this was like, that seems like a huge hole and it's, you know, it will be very valuable or like, you know, very valuable problem to solve. So if you want to set up that environments, you know, this is what we spent the past seven years doing. And the reality is existing developers have running locally. They already know how to set up that environment. So the problem isn't as acute for them. When we put Bolt online, we took that technology called WebContainer and married it with these, you know, state of the art frontier models. And the people that have the most pain with getting stuff set up locally is people that don't code. I think that's been, you know, really the big explosive reason is no one else has been trying to make dev environments work inside of a browser tab, you know, for the past if since ever, other than basically our company, largely because there wasn't an immediate demand or need. So I think we kind of find ourselves at the right place at the right time. And again, for this market of people that don't know how to write software, you would kind of expect that you should be able to do this without downloading something to your computer in the same way that, hey, I don't have to download Photoshop now to make designs because there's Figma. I don't have to download Word because there's, you know, Google Docs. They're kind of looking at this as that sort of thing, right? Which was kind of the, you know, our impetus and kind of vision from the get-go. But you know, the code gen, the AI code gen stuff that's come out has just been, you know, an order of magnitude multiplier on how magic that is, right? So that's kind of my best distillation of like, what is going on here, you know?Alessio [00:15:21]: And you can deploy too, right?Eric [00:15:22]: Yeah.Alessio [00:15:23]: Yeah.Eric [00:15:24]: And so that's, what's really cool is it's, you know, we have deployment built in with Netlify and this is actually, I think, Sean, you actually built this at Netlify when you were there. Yeah. It's one of the most brilliant integrations actually, because, you know, effectively the API that Sean built, maybe you can speak to it, but like as a provider, we can just effectively give files to Netlify without the user even logging in and they have a live website. And if they want to keep, hold onto it, they can click a link and claim it to their Netlify account. But it basically is just this really magic experience because when you come to Bolt, you say, I want a website. Like my mom, 70, 71 years old, made her first website, you know, on the internet two weeks ago, right? It was about her nursing days.Swyx [00:16:03]: Oh, that's fantastic though. It wouldn't have been made.Eric [00:16:06]: A hundred percent. Cause even in, you know, when we've had a lot of people building personal, like deeply personal stuff, like in the first week we launched this, the sales guy from the East Coast, you know, replied to a tweet of mine and he said, thank you so much for building this to your team. His daughter has a medical condition and so for her to travel, she has to like line up donors or something, you know, so ahead of time. And so he actually used Bolt to make a website to do that, to actually go and send it to folks in the region she was going to travel to ahead of time. I was really touched by it, but I also thought like, why, you know, why didn't he use like Wix or Squarespace? Right? I mean, this is, this is a solved problem, quote unquote, right? And then when I thought, I actually use Squarespace for my, for my, uh, the wedding website for my wife and I, like back in 2021, so I'm familiar, you know, it was, it was faster. I know how to code. I was like, this is faster. Right. And I thought back and I was like, there's a whole interface you have to learn how to use. And it's actually not that simple. There's like a million things you can configure in that thing. When you come to Bolt, there's a, there's a text box. You just say, I need a, I need a wedding website. Here's the date. Here's where it is. And here's a photo of me and my wife, put it somewhere relevant. It's actually the simplest way. And that's what my, when my mom came, she said, uh, I'm Pat Simons. I was a nurse in the seventies, you know, and like, here's the things I did and a website came out. So coming back to why is this such a, I think, why are we seeing this sort of growth? It's, this is the simplest interface I think maybe ever created to actually build it, a deploy a website. And then that website, my mom made, she's like, okay, this looks great. And there's, there's one button, you just click it, deploy, and it's live and you can buy a domain name, attach it to it. And you know, it's as simple as it gets, it's getting even simpler with some of the stuff we're working on. But anyways, so that's, it's, it's, uh, it's been really interesting to see some of the usage like that.Swyx [00:17:46]: I can offer my perspective. So I, you know, I probably should have disclosed a little bit that, uh, I'm a, uh, stack list investor.Alessio [00:17:53]: Canceled the episode. I know, I know. Don't play it now. Pause.Eric actually reached out to ShowMeBolt before the launch. And we, you know, we talked a lot about, like, the framing of, of what we're going to talk about how we marketed the thing, but also, like, what we're So that's what Bolt was going to need, like a whole sort of infrastructure.swyx: Netlify, I was a maintainer but I won't take claim for the anonymous upload. That's actually the origin story of Netlify. We can have Matt Billman talk about it, but that was [00:18:00] how Netlify started. You could drag and drop your zip file or folder from your desktop onto a website, it would have a live URL with no sign in.swyx: And so that was the origin story of Netlify. And it just persists to today. And it's just like it's really nice, interesting that both Bolt and CognitionDevIn and a bunch of other sort of agent type startups, they all use Netlify to deploy because of this one feature. They don't really care about the other features.swyx: But, but just because it's easy for computers to use and talk to it, like if you build an interface for computers specifically, that it's easy for them to Navigate, then they will be used in agents. And I think that's a learning that a lot of developer tools companies are having. That's my bolt launch story and now if I say all that stuff.swyx: And I just wanted to come back to, like, the Webcontainers things, right? Like, I think you put a lot of weight on the technical modes. I think you also are just like, very good at product. So you've, you've like, built a better agent than a lot of people, the rest of us, including myself, who have tried to build these things, and we didn't get as far as you did.swyx: Don't shortchange yourself on products. But I think specifically [00:19:00] on, on infra, on like the sandboxing, like this is a thing that people really want. Alessio has Bax E2B, which we'll have on at some point, talking about like the sort of the server full side. But yours is, you know, inside of the browser, serverless.swyx: It doesn't cost you anything to serve one person versus a million people. It doesn't, doesn't cost you anything. I think that's interesting. I think in theory, we should be able to like run tests because you can run the full backend. Like, you can run Git, you can run Node, you can run maybe Python someday.swyx: We talked about this. But ideally, you should be able to have a fully gentic loop, running code, seeing the errors, correcting code, and just kind of self healing, right? Like, I mean, isn't that the dream?Eric: Totally.swyx: Yeah,Eric: totally. At least in bold, we've got, we've got a good amount of that today. I mean, there's a lot more for us to do, but one of the nice things, because like in web container, you know, there's a lot of kind of stuff you go Google like, you know, turn docker container into wasm.Eric: You'll find a lot of stuff out there that will do that. The problem is it's very big, it's slow, and that ruins the experience. And so what we ended up doing is just writing an operating system from [00:20:00] scratch that was just purpose built to, you know, run in a browser tab. And the reason being is, you know, Docker 2 awesome things will give you an image that's like out 60 to 100 megabits, you know, maybe more, you know, and our, our OS, you know, kind of clocks in, I think, I think we're in like a, maybe, maybe a megabyte or less or something like that.Eric: I mean, it's, it's, you know, really, really, you know, stripped down.swyx: This is basically the task involved is I understand that it's. Mapping every single, single Linux call to some kind of web, web assembly implementation,Eric: but more or less, and, and then there's a lot of things actually, like when you're looking at a dev environment, there's a lot of things that you don't need that a traditional OS is gonna have, right?Eric: Like, you know audio drivers or you like, there's just like, there's just tons of things. Oh, yeah. Right. Yeah. That goes . Yeah. You can just kind, you can, you can kind of tos them. Or alternatively, what you can do is you can actually be the nice thing. And this is, this kind of comes back to the origins of browsers, which is, you know, they're, they're at the beginning of the web and, you know, the late nineties, there was two very different kind of visions for the web where Alan Kay vehemently [00:21:00] disagree with the idea that should be document based, which is, you know, Tim Berners Lee, you know, that, and that's kind of what ended up winning, winning was this document based kind of browsing documents on the web thing.Eric: Alan Kay, he's got this like very famous quote where he said, you know, you want web browsers to be mini operating systems. They should download little mini binaries and execute with like a little mini virtualized operating system in there. And what's kind of interesting about the history, not to geek out on this aspect, what's kind of interesting about the history is both of those folks ended up being right.Eric: Documents were actually the pragmatic way that the web worked. Was, you know, became the most ubiquitous platform in the world to the degree now that this is why WebAssembly has been invented is that we're doing, we need to do more low level things in a browser, same thing with WebGPU, et cetera. And so all these APIs, you know, to build an operating system came to the browser.Eric: And that was actually the realization we had in 2017 was, holy heck, like you can actually, you know, service workers, which were designed for allowing your app to work offline. That was the kind of the key one where it was like, wait a second, you can actually now run. Web servers within a [00:22:00] browser, like you can run a server that you open up.Eric: That's wild. Like full Node. js. Full Node. js. Like that capability. Like, I can have a URL that's programmatically controlled. By a web application itself, boom. Like the web can build the web. The primitive is there. Everyone at the time, like we talked to people that like worked on, you know Chrome and V8 and they were like, uhhhh.Eric: You know, like I don't know. But it's one of those things you just kind of have to go do it to find out. So we spent a couple of years, you know, working on it and yeah. And, and, and got to work in back in 2021 is when we kind of put the first like data of web container online. Butswyx: in partnership with Google, right?swyx: Like Google actually had to help you get over the finish line with stuff.Eric: A hundred percent, because well, you know, over the years of when we were doing the R and D on the thing. Kind of the biggest challenge, the two ways that you can kind of test how powerful and capable a platform are, the two types of applications are one, video games, right, because they're just very compute intensive, a lot of calculations that have to happen, right?Eric: The second one are IDEs, because you're talking about actually virtualizing the actual [00:23:00] runtime environment you are in to actually build apps on top of it, which requires sophisticated capabilities, a lot of access to data. You know, a good amount of compute power, right, to effectively, you know, building app in app sort of thing.Eric: So those, those are the stress tests. So if your platform is missing stuff, those are the things where you find out. Those are, those are the people building games and IDEs. They're the ones filing bugs on operating system level stuff. And for us, browser level stuff.Eric [00:23:47]: yeah, what ended up happening is we were just hammering, you know, the Chromium bug tracker, and they're like, who are these guys? Yeah. And, and they were amazing because I mean, just making Chrome DevTools be able to debug, I mean, it's, it's not, it wasn't originally built right for debugging an operating system, right? They've been phenomenal working with us and just kind of really pushing the limits, but that it's a rising tide that's kind of lifted all boats because now there's a lot of different types of applications that you can debug with Chrome Dev Tools that are running a browser that runs more reliably because just the stress testing that, that we and, you know, games that are coming to the web are kind of pushing as well, but.Itamar [00:24:23]: That's awesome. About the testing, I think like most, let's say coding assistant from different kinds will need this loop of testing. And even I would add code review to some, to some extent that you mentioned. How is testing different from code review? Code review could be, for example, PR review, like a code review that is done at the point of when you want to merge branches. But I would say that code review, for example, checks best practices, maintainability, and so on. It's not just like CI, but more than CI. And testing is like a more like checking functionality, et cetera. So it's different. We call, by the way, all of these together code integrity, but that's a different story. Just to go back to the, to the testing and specifically. Yeah. It's, it's, it's since the first slide. Yeah. We're consistent. So if we go back to the testing, I think like, it's not surprising that for us testing is important and for Bolt it's testing important, but I want to shed some light on a different perspective of it. Like let's think about autonomous driving. Those startups that are doing autonomous driving for highway and autonomous driving for the city. And I think like we saw the autonomous of the highway much faster and reaching to a level, I don't know, four or so much faster than those in the city. Now, in both cases, you need testing and quote unquote testing, you know, verifying validation that you're doing the right thing on the road and you're reading and et cetera. But it's probably like so different in the city that it could be like actually different technology. And I claim that we're seeing something similar here. So when you're building the next Wix, and if I was them, I was like looking at you and being a bit scared. That's what you're disrupting, what you just said. Then basically, I would say that, for example, the UX UI is freaking important. And because you're you're more aiming for the end user. In this case, maybe it's an end user that doesn't know how to develop for developers. It's also important. But let alone those that do not know to develop, they need a slick UI UX. And I think like that's one reason, for example, I think Cursor have like really good technology. I don't know the underlying what's under the hood, but at least what they're saying. But I think also their UX UI is great. It's a lot because they did their own ID. While if you're aiming for the city AI, suddenly like there's a lot of testing and code review technology that it's not necessarily like that important. For example, let's talk about integration tests. Probably like a lot of what you're building involved at the moment is isolated applications. Maybe the vision or the end game is maybe like having one solution for everything. It could be that eventually the highway companies will go into the city and the other way around. But at the beginning, there is a difference. And integration tests are a good example. I guess they're a bit less important. And when you think about enterprise software, they're really important. So to recap, like I think like the idea of looping and verifying your test and verifying your code in different ways, testing or code review, et cetera, seems to be important in the highway AI and the city AI, but in different ways and different like critical for the city, even more and more variety. Actually, I was looking to ask you like what kind of loops you guys are doing. For example, when I'm using Bolt and I'm enjoying it a lot, then I do see like sometimes you're trying to catch the errors and fix them. And also, I noticed that you're breaking down tasks into smaller ones and then et cetera, which is already a common notion for a year ago. But it seems like you're doing it really well. So if you're willing to share anything about it.Eric [00:28:07]: Yeah, yeah. I realized I never actually hit the punchline of what I was saying before. I mentioned the point about us kind of writing an operating system from scratch because what ended up being important about that is that to your point, it's actually a very, like compared to like a, you know, if you're like running cursor on anyone's machine, you kind of don't know what you're dealing with, with the OS you're running on. There could be an error happens. It could be like a million different things, right? There could be some config. There could be, it could be God knows what, right? The thing with WebConnect is because we wrote the entire thing from scratch. It's actually a unified image basically. And we can instrument it at any level that we think is going to be useful, which is exactly what we did when we started building Bolt is we instrumented stuff at like the process level, at the runtime level, you know, et cetera, et cetera, et cetera. Stuff that would just be not impossible to do on local, but to do that in a way that works across any operating system, whatever is, I mean, would just be insanely, you know, insanely difficult to do right and reliably. And that's what you saw when you've used Bolt is that when an error actually will occur, whether it's in the build process or the actual web application itself is failing or anything kind of in between, you can actually capture those errors. And today it's a very primitive way of how we've implemented it largely because the product just didn't exist 90 days ago. So we're like, we got some work ahead of us and we got to hire some more a little bit, but basically we present and we say, Hey, this is, here's kind of the things that went wrong. There's a fix it button and then a ignore button, and then you can just hit fix it. And then we take all that telemetry through our agent, you run it through our agent and say, kind of, here's the state of the application. Here's kind of the errors that we got from Node.js or the browser or whatever, and like dah, dah, dah, dah. And it can take a crack at actually solving it. And it's actually pretty darn good at being able to do that. That's kind of been a, you know, closing the loop and having it be a reliable kind of base has seemed to be a pretty big upgrade over doing stuff locally, just because I think that's a pretty key ingredient of it. And yeah, I think breaking things down into smaller tasks, like that's, that's kind of a key part of our agent. I think like Claude did a really good job with artifacts. I think, you know, us and kind of everyone else has, has kind of taken their approach of like actually breaking out certain tasks in a certain order into, you know, kind of a concrete way. And, and so actually the core of Bolt, I know we actually made open source. So you can actually go and check out like the system prompts and et cetera, and you can run it locally and whatever have you. So anyone that's interested in this stuff, I'd highly recommend taking a look at. There's not a lot of like stuff that's like open source in this realm. It's, that was one of the fun things that we've we thought would be cool to do. And people, people seem to like it. I mean, there's a lot of forks and people adding different models and stuff. So it's been cool to see.Swyx [00:30:41]: Yeah. I'm happy to add, I added real-time voice for my opening day demo and it was really fun to hack with. So thank you for doing that. Yeah. Thank you. I'm going to steal your code.Eric [00:30:52]: Because I want that.Swyx [00:30:52]: It's funny because I built on top of the fork of Bolt.new that already has the multi LLM thing. And so you just told me you're going to merge that in. So then you're going to merge two layers of forks down into this thing. So it'll be fun.Eric [00:31:03]: Heck yeah.Alessio [00:31:04]: Just to touch on like the environment, Itamar, you maybe go into the most complicated environments that even the people that work there don't know how to run. How much of an impact does that have on your performance? Like, you know, it's most of the work you're doing actually figuring out environment and like the libraries, because I'm sure they're using outdated version of languages, they're using outdated libraries, they're using forks that have not been on the public internet before. How much of the work that you're doing is like there versus like at the LLM level?Itamar [00:31:32]: One of the reasons I was asking about, you know, what are the steps to break things down, because it really matters. Like, what's the tech stack? How complicated the software is? It's hard to figure it out when you're dealing with the real world, any environment of enterprise as a city, when I'm like, while maybe sometimes like, I think you do enable like in Bolt, like to install stuff, but it's quite a like controlled environment. And that's a good thing to do, because then you narrow down and it's easier to make things work. So definitely, there are two dimensions, I think, actually spaces. One is the fact just like installing our software without yet like doing anything, making it work, just installing it because we work with enterprise and Fortune 500, etc. Many of them want on prem solution.Swyx [00:32:22]: So you have how many deployment options?Itamar [00:32:24]: Basically, we had, we did a metric metrics, say 96 options, because, you know, they're different dimensions. Like, for example, one dimension, we connect to your code management system to your Git. So are you having like GitHub, GitLab? Subversion? Is it like on cloud or deployed on prem? Just an example. Which model agree to use its APIs or ours? Like we have our Is it TestGPT? Yeah, when we started with TestGPT, it was a huge mistake name. It was cool back then, but I don't think it's a good idea to name a model after someone else's model. Anyway, that's my opinion. So we gotSwyx [00:33:02]: I'm interested in these learnings, like things that you change your mind on.Itamar [00:33:06]: Eventually, when you're building a company, you're building a brand and you want to create your own brand. By the way, when I thought about Bolt.new, I also thought about if it's not a problem, because when I think about Bolt, I do think about like a couple of companies that are already called this way.Swyx [00:33:19]: Curse companies. You could call it Codium just to...Itamar [00:33:24]: Okay, thank you. Touche. Touche.Eric [00:33:27]: Yeah, you got to imagine the board meeting before we launched Bolt, one of our investors, you can imagine they're like, are you sure? Because from the investment side, it's kind of a famous, very notorious Bolt. And they're like, are you sure you want to go with that name? Oh, yeah. Yeah, absolutely.Itamar [00:33:43]: At this point, we have actually four models. There is a model for autocomplete. There's a model for the chat. There is a model dedicated for more for code review. And there is a model that is for code embedding. Actually, you might notice that there isn't a good code embedding model out there. Can you name one? Like dedicated for code?Swyx [00:34:04]: There's code indexing, and then you can do sort of like the hide for code. And then you can embed the descriptions of the code.Itamar [00:34:12]: Yeah, but you do see a lot of type of models that are dedicated for embedding and for different spaces, different fields, etc. And I'm not aware. And I know that if you go to the bedrock, try to find like there's a few code embedding models, but none of them are specialized for code.Swyx [00:34:31]: Is there a benchmark that you would tell us to pay attention to?Itamar [00:34:34]: Yeah, so it's coming. Wait for that. Anyway, we have our models. And just to go back to the 96 option of deployment. So I'm closing the brackets for us. So one is like dimensional, like what Git deployment you have, like what models do you agree to use? Dotter could be like if it's air-gapped completely, or you want VPC, and then you have Azure, GCP, and AWS, which is different. Do you use Kubernetes or do not? Because we want to exploit that. There are companies that do not do that, etc. I guess you know what I mean. So that's one thing. And considering that we are dealing with one of all four enterprises, we needed to deal with that. So you asked me about how complicated it is to solve that complex code. I said, it's just a deployment part. And then now to the software, we see a lot of different challenges. For example, some companies, they did actually a good job to build a lot of microservices. Let's not get to if it's good or not, but let's first assume that it is a good thing. A lot of microservices, each one of them has their own repo. And now you have tens of thousands of repos. And you as a developer want to develop something. And I remember me coming to a corporate for the first time. I don't know where to look at, like where to find things. So just doing a good indexing for that is like a challenge. And moreover, the regular indexing, the one that you can find, we wrote a few blogs on that. By the way, we also have some open source, different than yours, but actually three and growing. Then it doesn't work. You need to let the tech leads and the companies influence your indexing. For example, Mark with different repos with different colors. This is a high quality repo. This is a lower quality repo. This is a repo that we want to deprecate. This is a repo we want to grow, etc. And let that be part of your indexing. And only then things actually work for enterprise and they don't get to a fatigue of, oh, this is awesome. Oh, but I'm starting, it's annoying me. I think Copilot is an amazing tool, but I'm quoting others, meaning GitHub Copilot, that they see not so good retention of GitHub Copilot and enterprise. Ooh, spicy. Yeah. I saw snapshots of people and we have customers that are Copilot users as well. And also I saw research, some of them is public by the way, between 38 to 50% retention for users using Copilot and enterprise. So it's not so good. By the way, I don't think it's that bad, but it's not so good. So I think that's a reason because, yeah, it helps you auto-complete, but then, and especially if you're working on your repo alone, but if it's need that context of remote repos that you're code-based, that's hard. So to make things work, there's a lot of work on that, like giving the controllability for the tech leads, for the developer platform or developer experience department in the organization to influence how things are working. A short example, because if you have like really old legacy code, probably some of it is not so good anymore. If you just fine tune on these code base, then there is a bias to repeat those mistakes or old practices, etc. So you need, for example, as I mentioned, to influence that. For example, in Coda, you can have a markdown of best practices by the tech leads and Coda will include that and relate to that and will not offer suggestions that are not according to the best practices, just as an example. So that's just a short list of things that you need to do in order to deal with, like you mentioned, the 100.1 to 100.2 version of software. I just want to say what you're doing is extremelyEric [00:38:32]: impressive because it's very difficult. I mean, the business of Stackplus, kind of before bulk came online, we sold a version of our IDE that went on-prem. So I understand what you're saying about the difficulty of getting stuff just working on-prem. Holy heck. I mean, that is extremely hard. I guess the question I have for you is, I mean, we were just doing that with kind of Kubernetes-based stuff, but the spread of Fortune 500 companies that you're working with, how are they doing the inference for this? Are you kind of plugging into Azure's OpenAI stuff and AWS's Bedrock, you know, Cloud stuff? Or are they just like running stuff on GPUs? Like, what is that? How are these folks approaching that? Because, man, what we saw on the enterprise side, I mean, I got to imagine that that's a huge challenge. Everything you said and more, like,Itamar [00:39:15]: for example, like someone could be, and I don't think any of these is bad. Like, they made their decision. Like, for example, some people, they're, I want only AWS and VPC on AWS, no matter what. And then they, some of them, like there is a subset, I will say, I'm willing to take models only for from Bedrock and not ours. And we have a problem because there is no good code embedding model on Bedrock. And that's part of what we're doing now with AWS to solve that. We solve it in a different way. But if you are willing to run on AWS VPC, but run your run models on GPUs or inferentia, like the new version of the more coming out, then our models can run on that. But everything you said is right. Like, we see like on-prem deployment where they have their own GPUs. We see Azure where you're using OpenAI Azure. We see cases where you're running on GCP and they want OpenAI. Like this cross, like a case, although there is Gemini or even Sonnet, I think is available on GCP, just an example. So all the options, that's part of the challenge. I admit that we thought about it, but it was even more complicated. And it took us a few months to actually, that metrics that I mentioned, to start clicking each one of the blocks there. A few months is impressive. I mean,Eric [00:40:35]: honestly, just that's okay. Every one of these enterprises is, their networking is different. Just everything's different. Every single one is different. I see you understand. Yeah. So that just cannot be understated. That it is, that's extremely impressive. Hats off.Itamar [00:40:50]: It could be, by the way, like, for example, oh, we're only AWS, but our GitHub enterprise is on-prem. Oh, we forgot. So we need like a private link or whatever, like every time like that. It's not, and you do need to think about it if you want to work with an enterprise. And it's important. Like I understand like their, I respect their point of view.Swyx [00:41:10]: And this primarily impacts your architecture, your tech choices. Like you have to, you can't choose some vendors because...Itamar [00:41:15]: Yeah, definitely. To be frank, it makes us hard for a startup because it means that we want, we want everyone to enjoy all the variety of models. By the way, it was hard for us with our technology. I want to open a bracket, like a window. I guess you're familiar with our Alpha Codium, which is an open source.Eric [00:41:33]: We got to go over that. Yeah. So I'll do that quickly.Itamar [00:41:36]: Yeah. A pin in that. Yeah. Actually, we didn't have it in the last episode. So, so, okay.Swyx [00:41:41]: Okay. We'll come back to that later, but let's talk about...Itamar [00:41:43]: Yeah. So, so just like shortly, and then we can double click on Alpha Codium. But Alpha Codium is a open source tool. You can go and try it and lets you compete on CodeForce. This is a website and a competition and actually reach a master level level, like 95% with a click of a button. You don't need to do anything. And part of what we did there is taking a problem and breaking it to different, like smaller blocks. And then the models are doing a much better job. Like we all know it by now that taking small tasks and solving them, by the way, even O1, which is supposed to be able to do system two thinking like Greg from OpenAI like hinted, is doing better on these kinds of problems. But still, it's very useful to break it down for O1, despite O1 being able to think by itself. And that's what we presented like just a month ago, OpenAI released that now they are doing 93 percentile with O1 IOI left and International Olympiad of Formation. Sorry, I forgot. Exactly. I told you I forgot. And we took their O1 preview with Alpha Codium and did better. Like it just shows like, and there is a big difference between the preview and the IOI. It shows like that these models are not still system two thinkers, and there is a big difference. So maybe they're not complete system two. Yeah, they need some guidance. I call them system 1.5. We can, we can have it. I thought about it. Like, you know, I care about this philosophy stuff. And I think like we didn't see it even close to a system two thinking. I can elaborate later. But closing the brackets, like we take Alpha Codium and as our principle of thinking, we take tasks and break them down to smaller tasks. And then we want to exploit the best model to solve them. So I want to enable anyone to enjoy O1 and SONET and Gemini 1.5, etc. But at the same time, I need to develop my own models as well, because some of the Fortune 500 want to have all air gapped or whatever. So that's a challenge. Now you need to support so many models. And to some extent, I would say that the flow engineering, the breaking down to two different blocks is a necessity for us. Why? Because when you take a big block, a big problem, you need a very different prompt for each one of the models to actually work. But when you take a big problem and break it into small tasks, we can talk how we do that, then the prompt matters less. What I want to say, like all this, like as a startup trying to do different deployment, getting all the juice that you can get from models, etc. is a big problem. And one need to think about it. And one of our mitigation is that process of taking tasks and breaking them down. That's why I'm really interested to know how you guys are doing it. And part of what we do is also open source. So you can see.Swyx [00:44:39]: There's a lot in there. But yeah, flow over prompt. I do believe that that does make sense. I feel like there's a lot that both of you can sort of exchange notes on breaking down problems. And I just want you guys to just go for it. This is fun to watch.Eric [00:44:55]: Yeah. I mean, what's super interesting is the context you're working in is, because for us too with Bolt, we've started thinking because our kind of existing business line was going behind the firewall, right? We were like, how do we do this? Adding the inference aspect on, we're like, okay, how does... Because I mean, there's not a lot of prior art, right? I mean, this is all new. This is all new. So I definitely am going to have a lot of questions for you.Itamar [00:45:17]: I'm here. We're very open, by the way. We have a paper on a blog or like whatever.Swyx [00:45:22]: The Alphacodeum, GitHub, and we'll put all this in the show notes.Itamar [00:45:25]: Yeah. And even the new results of O1, we published it.Eric [00:45:29]: I love that. And I also just, I think spiritually, I like your approach of being transparent. Because I think there's a lot of hype-ium around AI stuff. And a lot of it is, it's just like, you have these companies that are just kind of keep their stuff closed source and then just max hype it, but then it's kind of nothing. And I think it kind of gives a bad rep to the incredible stuff that's actually happening here. And so I think it's stuff like what you're doing where, I mean, true merit and you're cracking open actual code for others to learn from and use. That strikes me as the right approach. And it's great to hear that you're making such incredible progress.Itamar [00:46:02]: I have something to share about the open source. Most of our tools are, we have an open source version and then a premium pro version. But it's not an easy decision to do that. I actually wanted to ask you about your strategy, but I think in your case, there is, in my opinion, relatively a good strategy where a lot of parts of open source, but then you have the deployment and the environment, which is not right if I get it correctly. And then there's a clear, almost hugging face model. Yeah, you can do that, but why should you try to deploy it yourself, deploy it with us? But in our case, and I'm not sure you're not going to hit also some competitors, and I guess you are. I wanted to ask you, for example, on some of them. In our case, one day we looked on one of our competitors that is doing code review. We're a platform. We have the code review, the testing, et cetera, spread over the ID to get. And in each agent, we have a few startups or a big incumbents that are doing only that. So we noticed one of our competitors having not only a very similar UI of our open source, but actually even our typo. And you sit there and you're kind of like, yeah, we're not that good. We don't use enough Grammarly or whatever. And we had a couple of these and we saw it there. And then it's a challenge. And I want to ask you, Bald is doing so well, and then you open source it. So I think I know what my answer was. I gave it before, but still interestingEric [00:47:29]: to hear what you think. GeoHot said back, I don't know who he was up to at this exact moment, but I think on comma AI, all that stuff's open source. And someone had asked him, why is this open source? And he's like, if you're not actually confident that you can go and crush it and build the best thing, then yeah, you should probably keep your stuff closed source. He said something akin to that. I'm probably kind of butchering it, but I thought it was kind of a really good point. And that's not to say that you should just open source everything, because for obvious reasons, there's kind of strategic things you have to kind of take in mind. But I actually think a pretty liberal approach, as liberal as you kind of can be, it can really make a lot of sense. Because that is so validating that one of your competitors is taking your stuff and they're like, yeah, let's just kind of tweak the styles. I mean, clearly, right? I think it's kind of healthy because it keeps, I'm sure back at HQ that day when you saw that, you're like, oh, all right, well, we have to grind even harder to make sure we stay ahead. And so I think it's actually a very useful, motivating thing for the teams. Because you might feel this period of comfort. I think a lot of companies will have this period of comfort where they're not feeling the competition and one day they get disrupted. So kind of putting stuff out there and letting people push it forces you to face reality soon, right? And actually feel that incrementally so you can kind of adjust course. And that's for us, the open source version of Bolt has had a lot of features people have been begging us for, like persisting chat messages and checkpoints and stuff. Within the first week, that stuff was landed in the open source versions. And they're like, why can't you ship this? It's in the open, so people have forked it. And we're like, we're trying to keep our servers and GPUs online. But it's been great because the folks in the community did a great job, kept us on our toes. And we've got to know most of these folks too at this point that have been building these things. And so it actually was very instructive. Like, okay, well, if we're going to go kind of land this, there's some UX patterns we can kind of look at and the code is open source to this stuff. What's great about these, what's not. So anyways, NetNet, I think it's awesome. I think from a competitive point of view for us, I think in particular, what's interesting is the core technology of WebContainer going. And I think that right now, there's really nothing that's kind of on par with that. And we also, we have a business of, because WebContainer runs in your browser, but to make it work, you have to install stuff from NPM. You have to make cores bypass requests, like connected databases, which all require server-side proxying or acceleration. And so we actually sell WebContainer as a service. One of the core reasons we open-sourced kind of the core components of Bolt when we launched was that we think that there's going to be a lot more of these AI, in-your-browser AI co-gen experiences, kind of like what Anthropic did with Artifacts and Clod. By the way, Artifacts uses WebContainers. Not yet. No, yeah. Should I strike that? I think that they've got their own thing at the moment, but there's been a lot of interest in WebContainers from folks doing things in that sort of realm and in the AI labs and startups and everything in between. So I think there'll be, I imagine, over the coming months, there'll be lots of things being announced to folks kind of adopting it. But yeah, I think effectively...Swyx [00:50:35]: Okay, I'll say this. If you're a large model lab and you want to build sandbox environments inside of your chat app, you should call Eric.Itamar [00:50:43]: But wait, wait, wait, wait, wait, wait. I have a question about that. I think OpenAI, they felt that people are not using their model as they would want to. So they built ChatGPT. But I would say that ChatGPT now defines OpenAI. I know they're doing a lot of business from their APIs, but still, is this how you think? Isn't Bolt.new your business now? Why don't you focus on that instead of the...Swyx [00:51:16]: What's your advice as a founder?Eric [00:51:18]: You're right. And so going into it, we, candidly, we were like, Bolt.new, this thing is super cool. We think people are stoked. We think people will be stoked. But we were like, maybe that's allowed. Best case scenario, after month one, we'd be mind blown if we added a couple hundred K of error or something. And we were like, but we think there's probably going to be an immediate huge business. Because there was some early poll on folks wanting to put WebContainer into their product offerings, kind of similar to what Bolt is doing or whatever. We were actually prepared for the inverse outcome here. But I mean, well, I guess we've seen poll on both. But I mean, what's happened with Bolt, and you're right, it's actually the same strategy as like OpenAI or Anthropic, where we have our ChatGPT to OpenAI's APIs is Bolt to WebContainer. And so we've kind of taken that same approach. And we're seeing, I guess, some of the similar results, except right now, the revenue side is extremely lopsided to Bolt.Itamar [00:52:16]: I think if you ask me what's my advice, I think you have three options. One is to focus on Bolt. The other is to focus on the WebContainer. The third is to raise one billion dollars and do them both. I'm serious. I think otherwise, you need to choose. And if you raise enough money, and I think it's big bucks, because you're going to be chased by competitors. And I think it will be challenging to do both. And maybe you can. I don't know. We do see these numbers right now, raising above $100 million, even without havingEric [00:52:49]: a product. You can see these. It's excellent advice. And I think what's been amazing, but also kind of challenging is we're trying to forecast, okay, well, where are these things going? I mean, in the initial weeks, I think us and all the investors in the company that we're sharing this with, it was like, this is cool. Okay, we added 500k. Wow, that's crazy. Wow, we're at a million now. Most things, you have this kind of the tech crunch launch of initiation and then the thing of sorrow. And if there's going to be a downtrend, it's just not coming yet. Now that we're kind of looking ahead, we're six weeks in. So now we're getting enough confidence in our convictions to go, okay, this seems to be the trend line. I'll tell you another reason whySwyx [00:53:33]: I think, where is Jasper? They actually just announced some new numbers recently. They're still surviving. They have gone down a lot. I think that the peak that I heard was a hundredItamar [00:53:42]: billion ARR. And now there's like tens of these. So I think their success was phenomenal, like what I see at Bolt. And I think if you want to keep that, probably, who am I? I'm just giving my two cents. You need to focus because you are going to see weeks, I think that you're disrupting their market. And you open sourced some of it and they have containers, I believe. And you need to fight. I can tell you that when we open source, I share with you a small competitor, but I can tell you, I have a friend who has built a billion dollar company and more. When we released Alpha Codium, he sent me a private email asking, what the f**k did you just do? Why did you release that? You should have kept it. Yeah, you released that open source. I'm thinking, build some stuff and now I can do that much more easily. I can tell you my answer and I thought that maybe you'll answer as well. Although I think Bolt is already very promising. For us, Alpha Codium 1 is like GPT 1. I agree with you. Being open and open source, etc. really helps to improve the product community, etc. But at some point, OpenAI closed their GPT 3.5 or whatever. And that was part of my answer. Alpha Codium is the agent that is compatible with GPT 1 and there is a lot to do for these agents to actually get that moment that we had with GPT 3.5, etc. as agents.Eric [00:55:11]: Yeah, I think you're dead right. And I think it just comes back to what GeoHot said. It's like, if you want to win, there's no other option than out hustling everyone else. And so I think that's kind of out hustling in the sense really meaning building the best product, building the best experiences. And so I think that's the only way kind of almost any route and open source and stuff just kind of burns the ships in a sense. And maybe that's the simplest way of saying it. You're burning the ships, but also it builds a lot of goodwill. I mean, there's tons of benefits to it. Salesforce are doing that, right?Itamar [00:55:43]: They're now going to be agent force or whatever. So you can also...Swyx [00:55:47]: We're going to try to get Mark on the podcast. And they're good friends with Salesforce. Any parting thoughts, any trends that you'reItamar [00:55:55]: super excited about? If we're talking about trends, I go back to our original podcast where we talked about the idea that the software world is built from specs, tests, and code. And I think you can see that one dimension are company startups that are rethinking the entire development environment, I think like Bolt, etc. And another dimension is where is their focus? Is it on the spec, is on the test and on the code? And I think it's interesting to see that from that view. We'll see more startup and more amazing announcements of new directions, new philosophy. So I think we'll see startup focusing, let's build everything from the spec. To some extent, I would say that Bolt is, from my understanding, you can say better, somewhere in the line between the spec and the code. Because you start, like I saw your demos, you're trying to describe things, not just in one row, because you want to look like you want it. So it's on that edge between connecting between spec and code. And you see others, I think all the IDEs, most of them are the new IDEs, or the fork are there. We are more focused from the test and to the code and to the spec, etc. So these are trends, I think we will see that. And I think another dimension to consider is, is it more for the highway AI, for the developers, maybe not even a technical person, or is it for the enterprise? And that also gives you different products. If they are aiming for different ICP, different ideal client profile, they will approach this triangle of spec and test and code. And that's how I see the world. And what I'm noticing is that we're seeing more and more of those new startups, new interfaces that are not focused on code. For example, talking more about the spec, talking more about the testing. Eventually, I think that that's where the world is going to. The code is going to be there, and there will be developers, etc. But as agent improves and capabilities of the LLMs and integrations to different parts of the development environment, we're going to see more and more focusing on the spec and the test. Basically, these two might unite, the spec and the test, because you can say that tests are runnable specs, to some extent. So that's another way to look atSwyx [00:58:23]: it. Yeah, that is literally on the slide here, runnable tests, right here. Yeah, I'm consistent.Itamar [00:58:27]: It's all consistent. Look, I talked about system one and system two more than a year ago. And now with O1, people are talking about system one. But I think we'll talk about it again, because I think they're totally, totally wrong about O1 being a system two. It is now in the hype or whatever, talking about that. But I think the agents are the ones that will take us towards system two. And the more they are aware of their environment, and aware of that sometimes they don't know what they don't know, then we'll really get to system two. But that'sSwyx [00:59:03]: a deeper discussion. It's a deeper discussion. I love the philosophy talk that we had last time as well. All right, so we're back on to Bolt, and Itamar had to leave for another interview. But we were just talking about what happened post-launch, right? And I held this emergency council of advisors for you, because we had never seen this before. And I was like, okay, I'm going to call all the smartest people I know to join this thing.Eric [00:59:27]: Which was extremely helpful. And I'm so appreciative. There's been a handful of me.Swyx [00:59:31]: You made one hire out of that.Eric [00:59:34]: Yeah, because it was like, I think I can't remember where we were at kind of ARR-wise when I had messaged you.Swyx [00:59:40]: It was like, you messaged me at like two or three. And then by the time we got everything together, it was four. And then, yeah, now it's at-Alessio [00:59:48]: Since Eric sat down five minutes.Swyx [00:59:52]: But I mean, it sounds like you accelerated, because you told me it was like 100k, 200k a day. And now it's accelerated?Eric [00:59:58]: Yeah, this past- I mean, every week has been kind of a blowout week as far as- Is it TikTok? We're digging into the degree that we can of just like where all this stuff's coming from. I mean, there's a ton of word of mouth, right? So that you can't- which you can't just like look by refer, right? So there's a ton of direct. But yeah, I mean, there's a lot of TikTok. There's a ton of YouTube. It's kind of, I think, been a sensation in the sort of like entrepreneurial, build your own SaaS, indie hacker, even developer circles. And I think, too, our team's been doing a really good job. Our folks just kind of like flipped a switch. And people were just working through the weekends or whatever to get stuff fixed. And so the product- and you'll see people say this online. Like today, there was a tweet. Someone was like, yeah, I tried this like the first week and I couldn't get whatever to work. Came back today, six weeks later, and this is ridiculous. Like this is so good, right? And so I think there's been an incredible amount of improvement to the product, to the agent, also to like the underlying models, too. Like Sonnet, they just happened to do an update with their release a couple of weeks ago. And so when we put our new agent online and the new Sonnet, we saw a huge bump in conversion just based on that. And so yeah, we've gone at that. When we were chatting, that must have been three weeks ago, maybe an average of 100K ARR per day. And this week, I will see- I've said this every week, but we'll see if it holds. The past couple of days have been like half a million of ARR per day, which is insane. I think today we've had peak traffic, just kind of set the previous- and that's kind of been every day this week. But anyways, yeah, I think things just continue to accelerate, which is kind of blowing my mind, because it's just the sheer numbers of this stuff are just mind-boggling.Alessio [01:01:40]: I think you almost suffered from the Twitter demo issues that other people had. The first time I saw Bolt, I saw the demo and I was like, oh, that's cool. I didn't go to try it because I was like, I've seen so many of these that it's like, I don't know if it's actually going to work. And then two days ago, I signed up to use it. I was building a Luma replacement. I'm done with Luma. And I was like, man, this thing really works. And I already knew you, of course. I was like, man, this thing really works. What the f**k? I was like, it's actually, I don't know if it's like the model, if it's like how you prompt it, but it's so good at coming up with the simplest thing to implement. So the Luma example, right? So first I was like, create a RSVP page for an event and it created a wedding RSVP. I don't know if it's your fault. I don't know if you bolted it. And then I was like, well, now it needs to have a way to create more events and added that. And then I was like, now it needs a way to like have an admin page to modify event. And maybe what I would have done as a developer is like, well, I'll create a different like admin view, you know, with all the events and then I'll have like the front end thing. And instead what it did is like, it created like a admin view with toggle on top and then like just a pencil button on every page to edit them in line, you know, and that was it. And I was like, yeah, that works just as well. And like for the model, that's probably the simplest way to do it because it like limits the amount of files that are there. Can you talk just more about how much of this is like the model coming out with it, how much you're prompting it to kind of like be very likeEric [01:03:04]: compressed and concise. A ton of it is the model, but I think what's interesting though, is you're kind of baseline model. If I just like, if it's kind of like try and put it into like a, you know, way, if you had to quantify, quantify, you know, the effect is obviously the model is like this sort of like 10X multiplier. You're how good the bottom line model is huge, huge swing. And then kind of what you can do on top of that, you can squeeze out three, four X kind of more. And so that's kind of where the realm of, you know, prompt engineering and multi-agent approaches, et cetera, kind of kick in. And so I think, I think with us, you know, our folks, like the guy on our side that, you know, led the web engineering, like that kind of our core technology for the past, you know, seven years here, you know, his name is Dominic Elm based out of Germany and he was one of the founding engineers of the company. You had previous to StackBlitz, he actually was doing machine learning and he basically had built a StackBlitz, like online ID for machine learning. So I think like, I kind of like Google Colab sort of thing, or like Hugging Face has their kind of version of this. Back in 2016, it wasn't as much of a market for this stuff, but he had been doing a lot of, you know, training, you know, ML models and that sort of thing. So I guess, you know, as we began, you know, kind of digging into AI stuff over the past year, he's been kind of leading that off. And so a lot of it, I really attribute to Dom's specific angle, cause he has deep understanding of our technology and how it works. Cause he's, you know, led the engineering on web container, but as you know, deep understanding of how these models work going and actually kind of writing out these you know, whether it's like the, the, the prompt engineering aspect of it or multi-agent or whatever, have you, you know, that's sort of like that much context. And, and the, and the other folks on the team are, are, you know, in the same, same sort of spot that have been working on this stuff. I think we'd be able to squeeze out a lot more than I've seen almost anything else out there, at least in the term of building web apps, at least. But I guess I think it's, I think it's kind of just because we we have more context on, on a fewer number of heads at the company. So we can kind of connect the dots of it faster, youSwyx [01:05:01]: know? Yeah. That's part of the issue with the whole raise a billion dollars thing. Like you actually run very lean and that's, that's actually been to your advantage.Eric [01:05:08]: Totally. And I think, you know, and I think we, we have to staff up because I mean, we went from, you know, call it zero customers to, you know, 20, 30,000 kind of, you know, in six weeks, we have to have certainly more customer support, customer success stuff, et cetera. But you know, also just on, on engineering we have to ramp up, but I do think that there's a, we saw this in the 2021 cycle, right? Where, you know, adding tons more people can, can, can be a thing that really hurts, you know, the company because you can, it's just harder. It's really hard to manage lots of people. Not if you're a big enough company to warrant a certain headcount, a 100%, you kind of have to do it. Right. But I think for us, it's worked just to really grow, grow the team slowly and intentionally. And so I think we're going to take the same approach here at a bit of a faster clip than we were previously. But to me, that would just be general advice to startups is like slowly intentionally as fast as you can to meet demand or whatever. Part of what I felt like you're in a unique position toSwyx [01:06:07]: talk about, but also kind of what we went through in our, in our call was I have PMF now, what is, is kind of what I've been saying. And so like, I think the first answer is hire a data scientist because we have to sort of figure out like from our data that you're now sitting on a ton of different customers and we don't really know the different customer segments. You're starting to get an idea of churn. You're starting to get an idea of like segmentation. You already had data enrichment. One of my most interesting quotes from you from that session was that because you were selling to enterprise for so long, you had already set up all that stuff and it's just like, wasn't useful for a more sort of developer bottom up centric approach.Eric [01:06:46]: Yeah. And particularly because for the first time in the company's history, we're selling primarily to almost non-developers. And so everything that we've ever, all the playbooks we had not relevant here basically. Right. So the, and you're one of one of our investors I talked with earlier this week, basically brought up a really great point, which is like, you are now a B2C company and how you operate needs to reflect that.Swyx [01:07:09]: Which is, which is what, I don't know.Eric [01:07:11]: Which is basically from an analytics perspective, like you're tracking everything. Right. And then to your point, you have, you have people kind of around the clock slicing and dicing data to understand who are these people coming in, who are the types of people you actually want to retain versus people that, you know, are just going to churn out. And that's okay. Cause they're not the actual like ICP that you're going for. Right. When you're building stuff for enterprise software, the bar is a lot lower. And then to kind of to, from the conversation before one of the biggest, and this is kind of what we found with StackBlitz, which is kind of interesting, you know, you mentioned it, it's like, it's as a startup, it's very hard to sell on-prem extremely true. But if you can do it, it's like the promised land because you know, these, these companies you know, the fortune 500s, they can write really large checks. And so when you're going and selling to them, it doesn't matter so much like on your website. Sure. You want to track the conversion to the enterprise contact form or whatever. Right. But what, what actually really matters is like the, a lot of human touch points of, Hey, we want to have a quarterly call after just getting installed this stuff. There's a whole playbook for that. And you need to hire sales engineers that can be on the ground floor and helping people install it. Then after that, you got to, okay, how do we make sure they're kind of constantly successful? Because you can't access like we can, our enterprise customer instances, we have no idea how often they're using them. Why? Because the whole point is that we can't see what they're up to for a good reason, right? Like they, they need to own their data. And so the way it's actually much, a very complicated problem of how do you have like build relationships where everyone's getting on calls, they can share kind of the telemetry that, that they can see within their instance. And you can kind of extrapolate that and make sure they're happy and successful. So that's, there's a whole art of that, of doing enterprise well, that we've gone and done and closed these folks totally unrelated to doing BC completely, completely unrelated for the most part. So anyway, so that, so that, you know, we're, as a company, we're, we're kind of reorienting, you know, our focus on, okay, going and actually really leaning in on analytics, whatever have you. And fortunately, like my co-founder and I, the art, the enterprise business of stack was, was the first time we had ever done enterprise primarily like things to the company we did before was B2C. Like we were selling people courses on how to do web development basically. Right. So a lot of the skillset that, you know, I had built up there, I able to pull that back off the shelf, dust it off, sharpen the blade. And, you know, we're doing email marketing, we're doing live streams, you know? So, so that's, it's, it's kind of cool to, you know, be shifting back to some of the, the, the, where we cut our teeth on back in the day.Alessio [01:09:35]: How did you pick the pricing? Because I had to pay.Swyx [01:09:38]: That's fantastic. You want to like slight, slightly like, yeah, you got a bit. It's like,Alessio [01:09:44]: you're running out of tokens, dude. I was like, f**k, I'm running out of tokens. It's like, I don't want to run out of tokens, but there's like five different tiers. Yeah. Right. Which are kind of like token based and capacity based. Yep. How do you kind of reconcile that? And the consumer side where maybe the consumer doesn't even really need to know what a token is, right? Like on that, like your mom probably doesn't really care what an AI token is. How did you structure it to start? How did you come up with that? And then maybe ideas that you have to like improve or like modify that.Eric [01:10:12]: Totally. Yeah. So we, so when we first launched with StackBlitz is like, we were an enterprise play, right? And so when we launched in 2017, I think we tried pricing 2018 or 2019, but like it was free for a long time. And then we had a 9𝑝𝑙𝑎𝑛𝑎𝑛𝑑𝑡ℎ𝑎𝑡𝑤𝑎𝑠𝑗𝑢𝑠𝑡𝑡ℎ𝑒𝑤𝑎𝑦𝑖𝑡𝑤𝑎𝑠.𝐼𝑡𝑤𝑎𝑠,𝑖𝑡𝑤𝑎𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑙𝑖𝑘𝑒𝑜𝑢𝑟,𝑜𝑢𝑟𝑑𝑜𝑙𝑙𝑎𝑟50ℎ𝑜𝑡𝑑𝑜𝑔𝑎𝑡𝐶𝑜𝑠𝑡𝑐𝑜.𝐼𝑡′𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑙𝑖𝑘𝑒𝑡ℎ𝑖𝑠,𝑡ℎ𝑖𝑠,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑗𝑢𝑠𝑡𝑙𝑜𝑤𝑝𝑟𝑖𝑐𝑒,𝑗𝑢𝑠𝑡,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡,𝑖𝑡𝑤𝑎𝑠𝑛′𝑡𝑡ℎ𝑒𝑝𝑟𝑖𝑚𝑎𝑟𝑦𝑟𝑒𝑣𝑑𝑟𝑖𝑣𝑒𝑟𝑎𝑛𝑑𝑤𝑒𝑗𝑢𝑠𝑡𝑤𝑎𝑛𝑡𝑒𝑑𝑡𝑜,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑠𝑎𝑦,𝐻𝑒𝑦,𝑝𝑎𝑦𝑓𝑜𝑟𝑠𝑜𝑚𝑒𝑚𝑜𝑟𝑒𝑠𝑡𝑜𝑟𝑎𝑔𝑒𝑎𝑛𝑑𝑝𝑟𝑖𝑣𝑎𝑡𝑒𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑠𝑜𝑟𝑤ℎ𝑎𝑡𝑒𝑣𝑒𝑟.𝐴𝑛𝑑𝑠𝑜𝑤𝑒𝑤𝑒𝑛𝑡𝑡𝑜𝑙𝑎𝑢𝑛𝑐ℎ𝑏𝑜𝑙𝑡𝑎𝑔𝑎𝑖𝑛,𝑙𝑖𝑘𝑒𝑜𝑢𝑟𝑒𝑥𝑝𝑒𝑐𝑡𝑎𝑡𝑖𝑜𝑛𝑤𝑎𝑠,𝐻𝑒𝑦,𝑤𝑒′𝑙𝑙𝑝𝑟𝑜𝑏𝑎𝑏𝑙𝑦𝑔𝑒𝑡𝑎𝑔𝑜𝑜𝑑𝑛𝑢𝑚𝑏𝑒𝑟𝑜𝑓𝑝𝑒𝑜𝑝𝑙𝑒𝑡ℎ𝑎𝑡′𝑙𝑙𝑠𝑖𝑔𝑛𝑢𝑝𝑎𝑛𝑑𝑏𝑒𝑒𝑥𝑐𝑖𝑡𝑒𝑑𝑎𝑏𝑜𝑢𝑡𝑖𝑡.𝐴𝑛𝑑𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑤𝑒′𝑟𝑒𝑛𝑜𝑡𝑡𝑜𝑜𝑐𝑜𝑛𝑐𝑒𝑟𝑛𝑒𝑑,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑤𝑒′𝑟𝑒𝑗𝑢𝑠𝑡,𝑤𝑒′𝑟𝑒𝑗𝑢𝑠𝑡𝑛𝑜𝑡,𝑤𝑒𝑤𝑒𝑟𝑒𝑢𝑛𝑝𝑟𝑒𝑝𝑎𝑟𝑒𝑑𝑓𝑜𝑟𝑡ℎ𝑒𝑡𝑠𝑢𝑛𝑎𝑚𝑖𝑡ℎ𝑎𝑡ℎ𝑖𝑡.𝐴𝑛𝑑𝑠𝑜𝑎𝑓𝑡𝑒𝑟𝑔𝑜𝑖𝑛𝑔𝑜𝑛𝑙𝑖𝑛𝑒𝑡ℎ𝑒𝑓𝑖𝑟𝑠𝑡𝑤𝑒𝑒𝑘,𝑤𝑒𝑤𝑒𝑟𝑒𝑙𝑖𝑘𝑒,𝑤𝑜𝑤,𝑡ℎ𝑖𝑠𝑖𝑠𝑐𝑜𝑜𝑙.𝑇ℎ𝑒𝑟𝑒′𝑠𝑎,𝐼𝑚𝑒𝑎𝑛,𝑖𝑡𝑗𝑢𝑠𝑡𝑘𝑒𝑝𝑡𝑔𝑟𝑜𝑤𝑖𝑛𝑔.𝐴𝑛𝑑𝑡ℎ𝑒𝑛𝑜𝑛𝑐𝑒𝑤𝑒ℎ𝑖𝑡𝑤𝑒𝑒𝑘𝑡𝑤𝑜,𝐼𝑚𝑒𝑎𝑛,𝑤𝑒𝑤𝑒𝑟𝑒𝑗𝑢𝑠𝑡𝑛𝑖𝑛𝑒𝑏𝑢𝑐𝑘𝑠𝑤𝑎𝑠,𝐼𝑚𝑒𝑎𝑛,𝑖𝑡′𝑠𝑙𝑖𝑘𝑒𝑡ℎ𝑒𝑐ℎ𝑒𝑎𝑝𝑒𝑠𝑡𝐴𝐼𝑐𝑜𝑑𝑖𝑛𝑔𝑡ℎ𝑖𝑛𝑔𝑦𝑜𝑢𝑐𝑎𝑛𝑔𝑒𝑡𝑚𝑎𝑦𝑏𝑒𝑜𝑡ℎ𝑒𝑟𝑡ℎ𝑎𝑛𝑐𝑜𝑝𝑖𝑙𝑜𝑡,𝑏𝑢𝑡𝑙𝑖𝑘𝑒𝑤𝑒𝑤𝑒𝑟𝑒𝑜𝑣𝑒𝑟𝑟𝑢𝑛𝑏𝑦𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝑡𝑖𝑐𝑘𝑒𝑡𝑠.𝐴𝑛𝑑𝐼𝑗𝑢𝑠𝑡,𝑎𝑛𝑑𝑗𝑢𝑠𝑡𝑡ℎ𝑒𝑠ℎ𝑒𝑒𝑟𝑣𝑜𝑙𝑢𝑚𝑒𝑜𝑓𝑝𝑒𝑜𝑝𝑙𝑒𝑐𝑜𝑚𝑖𝑛𝑔𝑖𝑛𝑎𝑛𝑑𝑖𝑡𝑗𝑢𝑠𝑡𝑙𝑎𝑤𝑠𝑜𝑓𝑠𝑢𝑝𝑝𝑙𝑦𝑎𝑛𝑑𝑑𝑒𝑚𝑎𝑛𝑑.𝑊𝑒𝑤𝑒𝑟𝑒𝑙𝑖𝑘𝑒,𝑜𝑘𝑎𝑦,𝑡ℎ𝑖𝑠𝑖𝑠𝑛′𝑡,𝑡ℎ𝑒𝑟𝑒′𝑠𝑛𝑜𝑤𝑎𝑦𝑤𝑒𝑐𝑎𝑛𝑠𝑐𝑎𝑙𝑒𝑡𝑜𝑚𝑒𝑒𝑡𝑡ℎ𝑖𝑠.𝐴𝑙𝑠𝑜𝑡ℎ𝑒𝑝𝑒𝑜𝑝𝑙𝑒𝑐𝑜𝑚𝑖𝑛𝑔𝑖𝑛𝑎𝑟𝑒𝑏𝑢𝑟𝑛𝑖𝑛𝑔𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑡ℎ𝑒𝑖𝑟𝑡𝑜𝑘𝑒𝑛𝑠𝑎𝑛𝑑𝑡ℎ𝑒𝑟𝑒′𝑠𝑛𝑜𝑤𝑎𝑦𝑡𝑜𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦𝑙𝑖𝑘𝑒𝑏𝑢𝑦𝑚𝑜𝑟𝑒𝑜𝑓𝑡ℎ𝑒𝑠𝑒𝑡ℎ𝑖𝑛𝑔𝑠.𝐴𝑛𝑑𝑛𝑖𝑛𝑒𝑏𝑢𝑐𝑘𝑠𝑖𝑠𝑗𝑢𝑠𝑡,𝑦𝑜𝑢𝑐𝑎𝑛′𝑡𝑔𝑒𝑡𝑡ℎ𝑎𝑡𝑚𝑢𝑐ℎ𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒𝑜𝑢𝑡𝑜𝑓𝑡ℎ𝑎𝑡.𝐴𝑛𝑑𝑠𝑜𝑡ℎ𝑒,ℎ𝑒𝑟𝑒′𝑠𝑡ℎ𝑒𝑜𝑡ℎ𝑒𝑟𝑡ℎ𝑖𝑛𝑔𝑡ℎ𝑎𝑡′𝑠𝑖𝑛𝑡𝑒𝑟𝑒𝑠𝑡𝑖𝑛𝑔𝑎𝑏𝑜𝑢𝑡𝑏𝑜𝑙𝑡𝑐𝑜𝑚𝑝𝑎𝑟𝑒𝑑𝑡𝑜𝑙𝑖𝑘𝑒𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔𝑙𝑖𝑘𝑒𝑐𝑜𝑝𝑖𝑙𝑜𝑡𝑜𝑟𝑤ℎ𝑎𝑡𝑒𝑣𝑒𝑟.𝐴𝑛𝑑𝑡ℎ𝑖𝑠𝑘𝑖𝑛𝑑𝑜𝑓𝑡𝑖𝑒𝑑𝑡ℎ𝑖𝑠,𝑠𝑜𝑟𝑟𝑦,𝑎𝑙𝑖𝑡𝑡𝑙𝑒𝑏𝑖𝑡𝑜𝑓𝑎𝑟𝑜𝑢𝑛𝑑𝑎𝑏𝑜𝑢𝑡𝑤𝑎𝑦𝑡𝑜𝑎𝑛𝑠𝑤𝑒𝑟𝑦𝑜𝑢𝑟𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛.𝐵𝑢𝑡𝑏𝑎𝑠𝑖𝑐𝑎𝑙𝑙𝑦𝑤ℎ𝑎𝑡𝑤𝑒𝑒𝑛𝑑𝑒𝑑𝑢𝑝𝑎𝑡𝑡ℎ𝑎𝑡𝑚𝑜𝑚𝑒𝑛𝑡,𝑤𝑒𝑒𝑛𝑑𝑒𝑑𝑢𝑝𝑟𝑒𝑎𝑙𝑖𝑧𝑖𝑛𝑔𝑖𝑠𝑡ℎ𝑎𝑡𝑤ℎ𝑒𝑛𝑦𝑜𝑢𝑢𝑠𝑒𝑐𝑜𝑝𝑖𝑙𝑜𝑡,𝑤ℎ𝑎𝑡𝑖𝑡′𝑠𝑠𝑒𝑛𝑑𝑖𝑛𝑔𝑢𝑝,𝑖𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑝𝑟𝑜𝑣𝑖𝑑𝑒𝑎𝑙𝑜𝑡𝑜𝑓𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑜𝑓𝑦𝑜𝑢𝑟𝑐𝑜𝑑𝑒𝑏𝑎𝑠𝑒.𝑇ℎ𝑒𝑦𝑡𝑟𝑦𝑎𝑛𝑑𝑟𝑒𝑑𝑢𝑐𝑒𝑡ℎ𝑒𝑎𝑚𝑜𝑢𝑛𝑡𝑜𝑓𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑎𝑠𝑚𝑢𝑐ℎ𝑎𝑠𝑡ℎ𝑒𝑦𝑐𝑎𝑛.𝐴𝑛𝑑𝐼𝑡ℎ𝑖𝑛𝑘,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑡ℎ𝑒𝑜𝑟𝑖𝑔𝑖𝑛𝑠𝑜𝑓𝑡ℎ𝑖𝑠𝑠𝑡𝑢𝑓𝑓𝑖𝑠𝑡ℎ𝑒𝑦,𝑒𝑣𝑒𝑟𝑦𝑜𝑛𝑒𝑘𝑖𝑛𝑑𝑜𝑓𝑤𝑎𝑛𝑡𝑠𝑡ℎ𝑖𝑠𝑙𝑖𝑘𝑒𝑙𝑜𝑤𝑝𝑟𝑖𝑐𝑒𝑝𝑜𝑖𝑛𝑡𝑤ℎ𝑒𝑟𝑒𝑖𝑡′𝑠𝑙𝑖𝑘𝑒𝑎𝑙𝑙𝑦𝑜𝑢𝑐𝑎𝑛𝑒𝑎𝑡.𝑆𝑜𝑖𝑡𝑗𝑢𝑠𝑡𝑘𝑖𝑛𝑑𝑜𝑓,𝑡ℎ𝑎𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑓𝑒𝑒𝑙𝑠𝑙𝑖𝑘𝑒,𝑐𝑎𝑢𝑠𝑒𝑖𝑡′𝑠𝑙𝑖𝑘𝑒,𝑖𝑡𝑎𝑙𝑚𝑜𝑠𝑡𝑙𝑖𝑘𝑒𝑁𝑒𝑡𝑓𝑙𝑖𝑥,𝑖𝑡′𝑠𝑙𝑖𝑘𝑒,𝐼′𝑙𝑙𝑝𝑎𝑦𝑎𝑡ℎ𝑖𝑛𝑔.𝐴𝑛𝑑𝑡ℎ𝑒𝑛𝐼𝑐𝑎𝑛𝑗𝑢𝑠𝑡𝑑𝑜𝑎𝑠𝑚𝑢𝑐ℎ𝑜𝑓𝑡ℎ𝑒𝑚𝑜𝑣𝑖𝑒𝑤𝑎𝑡𝑐ℎ𝑖𝑛𝑔𝑎𝑠𝐼𝑤𝑎𝑛𝑡.𝐴𝑛𝑑𝐼𝑡ℎ𝑖𝑛𝑘,𝐼𝑡ℎ𝑖𝑛𝑘𝑡ℎ𝑎𝑡,𝑡ℎ𝑎𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑚𝑒𝑛𝑡𝑎𝑙𝑖𝑡𝑦,𝑤ℎ𝑒𝑛𝑡ℎ𝑒𝑠𝑒𝑓𝑖𝑟𝑠𝑡𝐴𝐼𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠𝑐𝑎𝑚𝑒,𝑖𝑡𝑘𝑖𝑛𝑑𝑜𝑓𝑚𝑎𝑘𝑒𝑠𝑠𝑒𝑛𝑠𝑒.𝑇ℎ𝑒𝑦′𝑟𝑒𝑙𝑖𝑘𝑒,𝑜𝑘𝑎𝑦,𝑤𝑒𝑙𝑙𝑤𝑒,𝑤𝑒𝑑𝑜𝑛′𝑡𝑤𝑎𝑛𝑡𝑡𝑜𝑚𝑒𝑡𝑒𝑟𝑖𝑡.𝐶𝑎𝑢𝑠𝑒𝑡ℎ𝑎𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑓𝑒𝑒𝑙𝑔𝑜𝑜𝑑.𝑅𝑖𝑔ℎ𝑡.𝐵𝑢𝑡𝑡ℎ𝑒𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑖𝑠𝑡ℎ𝑎𝑡𝑡ℎ𝑒𝑛𝑡ℎ𝑒𝑦′𝑟𝑒𝑖𝑛𝑐𝑒𝑛𝑡𝑖𝑣𝑖𝑧𝑒𝑑𝑡𝑜𝑛𝑜𝑡ℎ𝑎𝑣𝑒𝑖𝑡𝑏𝑒𝑎𝑏𝑙𝑒𝑡𝑜𝑘𝑒𝑒𝑝𝑡ℎ𝑒𝑚𝑜𝑟𝑒𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑦𝑜𝑢𝑔𝑖𝑣𝑒𝑖𝑡,𝑡ℎ𝑒𝑚𝑜𝑟𝑒𝑖𝑡𝑐𝑎𝑛𝑑𝑜.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑡ℎ𝑒𝑚𝑎𝑔𝑖𝑐𝑜𝑓𝑤ℎ𝑎𝑡𝑤𝑒′𝑟𝑒𝑑𝑜𝑖𝑛𝑔𝑤𝑖𝑡ℎ𝑏𝑜𝑙𝑑𝑖𝑠𝑤𝑒′𝑟𝑒𝑔𝑖𝑣𝑖𝑛𝑔𝑖𝑡𝑎𝑙𝑙𝑡ℎ𝑒𝑐𝑜𝑛𝑡𝑒𝑥𝑡𝑤𝑒𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑦𝑐𝑎𝑛.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑤ℎ𝑦𝑦𝑜𝑢𝑐𝑎𝑛𝑔𝑜𝑡𝑜𝑖𝑡𝑎𝑛𝑑𝑠𝑎𝑦,𝑚𝑎𝑘𝑒𝑚𝑒𝑎𝑛𝑅𝑆𝑉𝑃𝑠𝑖𝑡𝑒.𝐴𝑛𝑑𝑖𝑡𝑑𝑜𝑒𝑠𝑛′𝑡𝑏𝑒𝑐𝑎𝑢𝑠𝑒𝑖𝑡ℎ𝑎𝑠𝑐𝑜𝑛𝑡𝑒𝑥𝑡,𝑡ℎ𝑒𝑒𝑛𝑡𝑖𝑟𝑒𝑠𝑡𝑎𝑡𝑒𝑜𝑓𝑡ℎ𝑒𝑎𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑒𝑡𝑐𝑒𝑡𝑒𝑟𝑎,𝑒𝑡𝑐𝑒𝑡𝑒𝑟𝑎.𝐴𝑛𝑑𝑡ℎ𝑎𝑡′𝑠𝑤ℎ𝑎𝑡𝑚𝑎𝑘𝑒𝑠𝑖𝑡𝑠𝑜𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒.𝑉𝑒𝑟𝑠𝑢𝑠𝑖𝑓𝑦𝑜𝑢𝑔𝑜𝑡𝑜𝑐𝑜−𝑝𝑖𝑙𝑜𝑡𝑎𝑛𝑑𝑠𝑎𝑦𝑡ℎ𝑎𝑡𝑖𝑡,𝑡ℎ𝑒𝑟𝑒′𝑙𝑙𝑏𝑒,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡𝑚𝑖𝑔ℎ𝑡𝑝𝑢𝑛𝑐ℎ𝑜𝑢𝑡𝑎𝑟𝑒𝑎𝑐𝑡𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡.𝑇ℎ𝑎𝑡′𝑠𝑡ℎ𝑒𝑏𝑢𝑡𝑡𝑜𝑛𝑡𝑜𝑐𝑟𝑒𝑎𝑡𝑒𝑡ℎ𝑒𝑡ℎ𝑖𝑛𝑔,𝑏𝑢𝑡𝑛𝑜𝑡𝑎𝑐𝑡𝑢𝑎𝑙𝑙𝑦𝑚𝑜𝑟𝑒𝑡ℎ𝑎𝑛𝑡ℎ𝑎𝑡.𝑆𝑜𝑎𝑛𝑦𝑤𝑎𝑦,𝑠𝑜,𝑢𝑚,𝑦𝑜𝑢𝑘𝑛𝑜𝑤,𝑎𝑛𝑑𝑎𝑡𝑡ℎ𝑒𝑡𝑖𝑚𝑒𝑤ℎ𝑒𝑛𝑝𝑒𝑜𝑝𝑙𝑒ℎ𝑎𝑣𝑒𝑏𝑜𝑢𝑔ℎ𝑡𝑡ℎ𝑒9planandthatwasjustthewayitwas.Itwas,itwaskindoflikeour,ourdollar50hotdogatCostco.It′skindoflikethis,this,youknow,justlowprice,just,youknow,it,itwasn′ttheprimaryrevdriverandwejustwantedto,youknow,say,Hey,payforsomemorestorageandprivateprojectsorwhatever.Andsowewenttolaunchboltagain,likeourexpectationwas,Hey,we′llprobablygetagoodnumberofpeoplethat′llsignupandbeexcitedaboutit.Andyouknow,we′renottooconcerned,youknow,we′rejust,we′rejustnot,wewereunpreparedforthetsunamithathit.Andsoaftergoingonlinethefirstweek,wewerelike,wow,thisiscool.There′sa,Imean,itjustkeptgrowing.Andthenoncewehitweektwo,Imean,wewerejustninebuckswas,Imean,it′slikethecheapestAIcodingthingyoucangetmaybeotherthancopilot,butlikewewereoverrunbysupporttickets.AndIjust,andjustthesheervolumeofpeoplecominginanditjustlawsofsupplyanddemand.Wewerelike,okay,thisisn′t,there′snowaywecanscaletomeetthis.Alsothepeoplecominginareburningthroughtheirtokensandthere′snowaytoactuallylikebuymoreofthesethings.Andninebucksisjust,youcan′tgetthatmuchinferenceoutofthat.Andsothe,here′stheotherthingthat′sinterestingaboutboltcomparedtolikesomethinglikecopilotorwhatever.Andthiskindoftiedthis,sorry,alittlebitofaroundaboutwaytoansweryourquestion.Butbasicallywhatweendedupatthatmoment,weendeduprealizingisthatwhenyouusecopilot,whatit′ssendingup,itdoesn′tprovidealotofcontextofyourcodebase.Theytryandreducetheamountofcontextasmuchastheycan.AndIthink,youknow,theoriginsofthisstuffisthey,everyonekindofwantsthislikelowpricepointwhereit′slikeallyoucaneat.Soitjustkindof,thatkindoffeelslike,causeit′slike,italmostlikeNetflix,it′slike,I′llpayathing.AndthenIcanjustdoasmuchofthemoviewatchingasIwant.AndIthink,Ithinkthat,thatkindofmentality,whenthesefirstAIproductscame,itkindofmakessense.They′relike,okay,wellwe,wedon′twanttometerit.Causethatdoesn′tfeelgood.Right.Buttheproblemisthatthenthey′reincentivizedtonothaveitbeabletokeepthemorecontextyougiveit,themoreitcando.Andthat′sthemagicofwhatwe′redoingwithboldiswe′regivingitallthecontextwepossiblycan.Andthat′swhyyoucangotoitandsay,makemeanRSVPsite.Anditdoesn′tbecauseithascontext,theentirestateoftheapplication,youknow,etcetera,etcetera.Andthat′swhatmakesitsoaccurate.Versusifyougotoco−pilotandsaythatit,there′llbe,youknow,itmightpunchoutareactcomponent.That′sthebuttontocreatethething,butnotactuallymorethanthat.Soanyway,so,um,youknow,andatthetimewhenpeoplehaveboughtthe9 plan, they were like, I want to give you more money. I want you to buy more tokens. How do I do that? And so our team scrambled that weekend, we just turned it around and just, you know, we said, okay, well, what do we think is reasonable? And we said, okay, so let's go, you immediately double the prices of the, of the base tier, because it's just not enough what people are getting on for nine bucks. So that'll be, that seems reasonable. It's kind of in line with everyone else. And then we added 50, 100 and $200 plans. Cause we're like, that should be enough. And so, yeah, so that, that's kind of the origins of it. And, and, um, it was, it was people that use it, fall in love with that and they want to use more of it. And the problem is the inference is expensive. And so we're not actually taking, you know, to date on the, on the revenue we've done, we have not really taken a margin at all on this stuff. Cause we're just trying to put all the value back into the folks that are there using the tool and just getting the maximum amount of value out of it. But it's really key to the kind of the magic of the experience. And so the other, the other thing kind of worth mentioning is there's kind of the ARR number, but then we, you can also buy additional tokens, you know, just with usage-based billing effectively. And that's accounting for an additional 20, 30% of, of revenue that's coming to the company. People are actually using this to do their jobs. Like, you think, think about a web development agency before this thing, they're going in using Figma to make a design. They have to pay the designer. They have to like punch that out into code, kind of man. And maybe like co-pilot can help a little bit with punching out this thing that they're coming to this thing. And there's just wild stories online where it's like guy bake, local bakeries, like we need a website. He's like, okay, well, I'm going to charge you a thousand bucks. They're like, okay, that sounds great. Reasonable price. 30 minutes later, he's like, here's a deploy preview of your thing. How does that look? They're like, wow, holy crap. I'm not giving you a thousand bucks. But they did, they were, they were, they were like, this usually takes months, you know? So some of the biggest power users are people that build websites for a living because this is the, the alpha on this is insane.Alessio [01:14:26]: That's almost like the gap, right? It's like, it used to be that if I ask you before this to do a website and in 30 minutes you return to me and you give me something, I'm like, you know, you're probably just copying something else you've done before, you know, versus now it's almost like, it doesn't really matter how much time it takes you because everybody's going to be so fast with these things. It's more like the value. And that's why when you're pricing TRL, it was almost like, there's only really going to be like either 20𝑎𝑚𝑜𝑛𝑡ℎ𝑢𝑠𝑒𝑟𝑠𝑜𝑟𝑙𝑖𝑘𝑒𝑎𝑡ℎ𝑜𝑢𝑠𝑎𝑛𝑑𝑑𝑜𝑙𝑙𝑎𝑟𝑠𝑎𝑚𝑜𝑛𝑡ℎ𝑢𝑠𝑒𝑟𝑠.𝑌𝑜𝑢𝑘𝑛𝑜𝑤,𝑖𝑡′𝑠𝑎𝑙𝑚𝑜𝑠𝑡𝑙𝑖𝑘𝑒𝑤ℎ𝑜′𝑠𝑔𝑜𝑖𝑛𝑔𝑡𝑜𝑢𝑠𝑒𝑡ℎ𝑒20amonthusersorlikeathousanddollarsamonthusers.Youknow,it′salmostlikewho′sgoingtousethe50 a month because it's kind of like in between, between being infrequent user and being like a power user, you know? So yeah, it makes sense that you have like a big part of like on demandEric [01:15:05]: on top of that. Yeah. And on the 50, there's actually a lot of people on the one. I think it's because it's like enough to actually like for developers are using this to just kind of like punch out components or designs or whatever, kind of gets them enough for, you know, kind of in a given month or whatever. And so it's been interesting to just kind of see the, the, you know, the, the upgrades that happen, but what's been kind of cool about the product is it's, and again, I think this is kind of novel and this is, you know, us being maybe a little more transparent than we should be or something, but like, I suspect we're just, I think we're going to see a lot more of this because we're hitting an inflection point coming back to the co-pilot thing. Part of the problem before is that it didn't matter if you provided more context, the models just weren't good enough to know what to even do with it. That's not the case now. You know, just one, one, you know, story of like one of the first people, one of the power, first power users that adopted Bolt was this gal in Thailand who's a PM at a software banking company. And she had an idea for this app called viralhooks.ai, which is basically, it's a tool that if you want to make viral TikToks and stuff, it's like, what's the hook of the video to make people watch. Right. And so basically she, you know, you can go and like, see, it goes and extracts hooks from other people's videos and helps you with like, you know, AI to write your own. And she had originally put the week before Bolt launched, she put that on Upwork and you know, some, I think a developer in like Ukraine had quoted her, you know, 5,000.𝐴𝑛𝑑𝑖𝑡′𝑠𝑔𝑜𝑖𝑛𝑔𝑡𝑜𝑡𝑎𝑘𝑒𝑙𝑖𝑘𝑒𝑡ℎ𝑟𝑒𝑒𝑚𝑜𝑛𝑡ℎ𝑠𝑜𝑟𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔𝑙𝑖𝑘𝑒𝑡ℎ𝑎𝑡.𝑅𝑒𝑎𝑠𝑜𝑛𝑎𝑏𝑙𝑒𝑡𝑖𝑚𝑒𝑓𝑟𝑎𝑚𝑒,𝑟𝑖𝑔ℎ𝑡.𝐹𝑜𝑟𝑎𝑛𝑎𝑝𝑝𝑙𝑖𝑘𝑒𝑡ℎ𝑎𝑡,𝑟𝑒𝑎𝑠𝑜𝑛𝑎𝑏𝑙𝑒𝑝𝑟𝑖𝑐𝑒.𝑇ℎ𝑒𝑤𝑒𝑒𝑘𝑎𝑓𝑡𝑒𝑟𝑡ℎ𝑎𝑡𝐵𝑜𝑙𝑡𝑐𝑎𝑚𝑒𝑜𝑢𝑡,𝑠ℎ𝑒𝑏𝑜𝑢𝑔ℎ𝑡𝑡ℎ𝑒5,000.Andit′sgoingtotakelikethreemonthsorsomethinglikethat.Reasonabletimeframe,right.Foranapplikethat,reasonableprice.TheweekafterthatBoltcameout,sheboughtthe50 plan and she had the app built within a week or two. And so you're talking about like, that's it. And it's beautiful. She did an incredible job. Right. And so the numbers are wild. 5,000,𝑡ℎ𝑟𝑒𝑒𝑚𝑜𝑛𝑡ℎ𝑠𝑡𝑜5,000,threemonthsto50 and like a week. Yeah. You got to charge more. So it's, it's kind of like, so there's, there's people like when we've had a lot of people go, this pricing is insane. And we're like, well, we're not even taking really a margin at the moment on it, you know, but also, but when you, when you compare that to the price of actually going and building the cost of building quality software today, anyone who knows the price of building quality software, the alpha is obvious, right? It's a 99% cost production and five X faster, you know, delivery time, you know? So anyway, so that's, I think we're one of the first products that have actually come out kind of proving that, you know, in, in, in a revenue way to kind of underscore the point, as you can imagine, we've had, you know, kind of venture capital firms kind of reach out and kind of, you know, curious to kind of, you know, what we're up to or whatever. And so, you know, one of the most, you know, there's kind of one of the, the most notable ones or whatever reached out. So we kind of sent them, you know, you know, kind of our numbers. Actually it was the investor update, Sean, that, that I think you, you know, the, you know, the one you saw kind of gave him a snapshot of it. And they one of their analysts accidentally replied all on what we had sent them and with, with the analysis. And so on this part there, you know, one of the things they said was we haven't seen anything that's kind of eyeopening to see people going to $200 tier on this sort of thing. Haven't seen anything else like that in the space. Cause I think this is very new because of the new model capabilities, right? Where people, you know, it makes sense. Like you're willing to pay more money for this stuff. So. This is something I've talked about before in terms of matchingSwyx [01:18:11]: the dollar amount of spend to the capabilities of the AIs. The chart that I published in the past was, you know, OpenAI has like five levels of AGI-ness and level, level one is sort of like a chatbots, level two is reasoning, level three is agents, four is organizations, five is some, something super, super human. I don't remember what the exact levels are, but each, you can sort of each match each of them with like tiers. Like 20𝑖𝑠𝑙𝑖𝑘𝑒𝑡ℎ𝑒𝑐ℎ𝑎𝑡𝐺𝐵𝑇𝑡𝑖𝑒𝑟.20islikethechatGBTtier.200 is where you're at. 2,000𝑖𝑠ℎ𝑖𝑔ℎ𝑒𝑟,2,000ishigher,20,000, $200,000, right? Like you can see levels where it makes sense. I think BrightWave is also there, by the way. Like I don't know what BrightWave charges, but it's higher, right? Than a chatGBT. And like, you have to deliver more value for that, but you, you can do it now. Yep. So then why not? Everyone should do it.Eric [01:18:58]: I think we're going to see a lot more of this. I think we're going to see, I think, you know, for AI, Cogen specifically, this is the first moment where I think that there's been that moment where it goes from zero to one, where it's like, yep. The price point, you know, the value, the value is so, like what you can get out of these things is so much higher than it was, you know, three, six months ago that I think we're going to see, I think we're going to see a lot more of this. Like we might, you know, Bolt is, I think one of the first things, but yeah, I mean, it's just, to me, it's inevitable that we're going to see many more things kind of leveraging this, this sort of use case and the amount of efficiency you can get out of usingAlessio [01:19:38]: these systems. Right. So yeah. Yeah. Yeah. Because I mean, the Bolt arbitrage would be quote the price based on the query, you know, you're selling high value tokens. Yeah. It's like, Hey, it's like your mom is like, you wouldn't charge your mom $2,000 to tell her stories, but like, you know, this person doing an app and like a product on it. Yeah. You got to pay more, you know, but it's hard right now. I understand. It's like, it's really hard to figure out how much you can push it, how much value the person will get outSwyx [01:20:04]: of the thing. Yeah. So I want to riff a little bit on like stuff like this, right? I think you nailed a lot with the design system. You know, one of the differences between open source Bolt and the one that you have is actually like you, you spend a lot of time on the design system. I think, right. Most things just look great when they come out, but I think there's also a whole backend portion that they need. Was that a challenge? Is there anything that you sort of like figuring out that you want to riff on? Yeah. So I think one of the main things,Eric [01:20:28]: I think you hit the nail on the head, which is, you know, kind of going into putting Bolt online. We originally, again, we've been selling to developers and so we were kind of like, this is a tool for prototyping and they'll download their code. But we ended up finding in the early user testing was how important the deployment story was and how, and this is something you said to me specifically, you're like backend, this needs to like backend needs to be part of this, like logging in, like off just to triple confirm you're dead right. That has been the absolute number one thing that folks coming to Bolt, you know, are looking to do is build a real app with a backend, with billing. And so one of this guy, Mauricio, he's one of our power users. He's like, there's three things that like every app that I'll ever want to build in Bolt, any of these other people in this community, you know, three things, a database, auth, and payments. So those three things, right. So that's- Admin dashboard. We can do that pretty decently, pretty decently. As in every database needs a WP admin. Yes. Yes. Correct. Totally. Totally. And so, yeah, today I think like viral hooks, for example, I think she's using Firebase for auth and database and that sort of thing. You know, so I think Firebase and Superbase, those are the two things that, that just work incredibly well. And so that's actually the point where we're at now, where, you know, right now it's, you know, folks have to still, you know, kind of go to Superbase, manually spin up a thing, come back to Bolt, but the thing that, you know, it's like that sort of processing thing with Firebase, each of those products are going to have their own little quirks that you have to, there's like kind of steps, right. And so- Boltbase. Yeah. Boltbase. Yeah. I think, yeah, I think initially we're like, okay, there should just be a way to like, for Bolt to just go and spin up these things on their behalf and just, and just, you know, both of them have APIs to do so. I'll go even further, like have like pre-warmSwyx [01:22:12]: instances that you just assign, like it's already spun up, right. So it's, so it's like kind of serverless feeling, even as like, not really, but like yeah, just pre-warm and then just kind of assign it when, whenever someone like- That's a really great point. Yeah. Just keep, keep oneEric [01:22:26]: Firebase in the hopper, basically. One, 10, 100, I don't know. More generally, this is what I feltSwyx [01:22:32]: that I wanted to do on our call, which is like, when you have PMF, yes, you want to invest some time in like understanding your customers and do a data analytics and like tighten, tighten things up in general, like tighten up the pricing, tighten up the cost and all that. But then like, you also have to work on like, what is next, like the next level and growth, like you can still inflect. Yeah. I don't know what that is, but you know, I wanted to, I wanted to keep pushing you and I don't know if I did, mostly because I was serving as facilitator on that call. That's what I think. Like, I think you got to still keep pushing the frontier and I don't know what it, what it is, but like, you know, I want to hear what you got thinking about.Eric [01:23:07]: I think there's, you know, we've addressed just a lot of the low hanging P0 stuff then, and we've actually seen, we've kind of the, you know, there's, there's key moments where it's just kind of like been going like that, which has been cool. Cause it's like, okay, well we were, we're just getting started. This is just the, this is just the fixing obvious things part. Fundamentally, I think a lot, what a lot of people are coming here to do is just, how can we just make it faster to go from idea to production? And a lot of it is like, I had, when I have to go to Firebase, Superbase, spin something up, run a migrate, you know, like add a table, but it's like the agent can do that, you know, so that stuff should be baked in. Yeah. And same thing with the deployment side. It's like right now it's going to Netlify, but people have to create a Netlify account and go and do that. Right. And so I think one of the things we're going to end up doing here is just having the hosting be baked in. And so I've been talking with Matt over at Netlify about this, cause they actually have a way to kind of white label stuff. And so, cause people are, they're just going to make a website, you know? And so it's I mean, that means also you take over domain registration. Can you imagine, right? Like a couple of months from now, you come to this thing, you're like, I want to make, I want to make an RSVP site. Right. And it's like, great. Do you, you know, do you have a name for it? Or do you want to, you know, a domain? You're like, I don't know a name. It's like, well, here's like 10 options and the.coms are able to look good. Yep. That one does. Okay. We want to buy it. Okay, great. It bought the DNS is pointed at the thing. Should we start building this? Okay. Does this look good? Yep. Okay. Am I okay to push this to prod? Yep. That looks good. You know, like that's without leaving the product.Swyx [01:24:31]: Right. So to me, like it's tomorrow was the first to actually say like you are the new Wix. I never, I personally never thought about it that way. Wix is a $10 billion company where you want to go, you know, cause you still have a choice here. From what we're hearing from the folks usingEric [01:24:43]: the product, I think I don't even think Wix is even able to solve their need, you know? But not to say that we don't want to, you know, that, that what you're saying is now we want, but, but I mean, yeah, like I think we want to solve folks problems. And I think that there's a huge gap in the market of being able to build, you know, kind of more sophisticated, high quality software like websites in a way that for someone who's a non-engineer. And so I think there's a huge market for that. And obviously, even if you're trying to build a wedding website, yeah, this is, this is easier and faster. Right. So I love it. I, you know, again, coming to the origins of why Albert, my co-founder and I are doing this is we've always just loved building stuff on the web. It's like this, I, this is the tool from what, even when stack was just the IDE interface to the technology, it's like, this is the thing we wish we had when we were 13 years old, you know? And with Bolt, oh my God, if this is the thing I wish we had when we were 13 years old, I'm so glad that my daughter's going to have this thing, you know? So anyways, yeah, I think it makes me pretty, pretty stoked that people are going to be able to actually build amazing web applications that can do really sophisticated things, you know? So yes, I think the short answer is heck yeah. I mean, yeah, that sort of market and totally right up our alley. One other angle that I wanted to pursue wasSwyx [01:25:53]: also the other languages. You know, you're very JavaScript centric. We've talked about Python forever. Ruby maybe, is that important? You know, like the previous generation of site builders were mostly Ruby shops and some PHP. Do we want to capture that or are we just like, you know, always been on JavaScript and just let JavaScript take over the world? You know, I think, I thinkEric [01:26:14]: we're, we're, we're certainly with great interest interested in other languages and we have like minimal support of Python and some C++ stuff in web container that you can like run or whatever. I think especially with the, with the stuff we're seeing though, it's the languages is kind of ancillary to the, to the, to the thing. Well, there's the ecosystem of like,Swyx [01:26:31]: I want to end up with a code base that I can hire humans on to do the stuff that Bolt cannot do.Eric [01:26:36]: Yeah, true. And I think, I think in that sense, like the, the, the JavaScript Node.js ecosystem is huge and well-established. So it's like, I think it'd be certainly be able to get people to work on this stuff. And I think the only thing that would be missing is it's like, are you building web apps that where a lot of the functionality is only in libraries that are in Python or something. Right. And I think just kind of seeing the applications that are being built here at, you know, I think that'd be like data science and like ML and that sort of thing. And so that's, we're not seeing a lot of that stuff, you know? And then, but I think that's like, we're like kind of a more generic approach is like what Repl.it's doing where they're spinning up real VMs. You can kind of run anything. And I think they started off with like doing Python service. I actually haven't tried their, their, you know, their new agent stuff that's based on.Swyx [01:27:15]: Repl.it agent. Yeah. We're close friends. Repl.it has the database, the sort of live hosting, everything integrated that you're going to want to build. And you're, I think you're on a collision course with them, to be honest.Eric [01:27:29]: We'll see. Cause I'm curious, you're not the first person to say that. I'm curious to see how it shakes out. Cause I think the challenge is focus. You know, when you are, what's kind of the end goal that you're shooting? Yeah, Repl.it's firmly for developers.Swyx [01:27:45]: You're positioning it for non-developers like that. That's legit.Eric [01:27:48]: Yeah. And even getting, even if focusing on a language or an ecosystem as well, because again, the problem is that these things can just break in a million ways. And so part of the, a lot of the work in making the experience better, like how do you get, like how make it, someone get an idea into the fingertips and live on prod, right? There's so much stuff in between there. And a lot of it is just errors that happen and how do you handle those? And a lot of that comes down to having a giant database of common errors that you can maybe even fine tune stuff on at some point, right? So doing that on, on one ecosystem, you can move a lot faster than if you're trying to support a lot of different languages. However, it's a, to the point of, if you're kind of targeting developers, they may not need that level of kind of streamline, you know, thing. I think that's kind of where I see the main divergence is that we are unabashedly focused on this ecosystem of, for building web apps. Got it. Yeah. You support it forever. Yeah. And so I'm very curious to see how, just how it all shakes out. Cause it's, I think what they're doing is actually, I mean, I'm very curious to see what Microsoft does because if anyone is good at giving out VMs, tying it to a coder and putting AI in it, it's Sia. He's got a cloud. He's got VS code. They've got code spaces. They've they're in open AI. Now they've got Anthropic and Copilot. I mean, I must imagine, I must imagine that they're cooking stuff overSwyx [01:29:06]: there, you know? We'll make sure to ask him. We have many friends from Microsoft listening to theAlessio [01:29:11]: pod. So just to wrap, I don't know, is there anything else Bolt related? I just have one personal question before we wrap the pod. Maybe like just advice, like now that you'veSwyx [01:29:20]: been through this journey, right? Advice to your former self. Oh, okay. Yeah. At which point? Advice yourself, like thinking about, there are many founders out there with a business where they're like, they're working really hard at it. It's interesting, but it's not an AI business. Yeah. And you kind of took the plunge to invest in this and it worked out for you. Maybe a lot of people are like, okay, like, you know, this guy got lucky. Obviously there's a little bit of luck in everything, but like, how do you improve your chances? Like, would you say, go for it? Would you say everyone should go for it? How would you advise someone who was in your shoes and thinking about, you know, maybe I should have a second product. Maybe I should take this, this experiment or maybe it doesn't work out. Like what is, what's the calculus here?Eric [01:30:01]: Yeah. We were deeply skeptical going. I remember the conversation you and I had, you know, I was like this, I think there's something here. At that point we had built some amount, but I had waited a long time to give you the call. I said, this is your moment. Well, it was. So I remember specifically at the beginning of the conversation with Sean, he and I sat down at a coffee shop and, and, and SF, and, and so I was kind of giving him the pitch of like, you know, I think we have, I think that I can't remember the exact framing. I said, but it's, it's, it was obvious that Sean had heard a lot of people say this exact thing to him over the past year or two, which is like, Hey man, we've gotten AI play. Like this is our thing plus AI equals this, this could be crazy. And Sean, I get, you gave me this like skeptical look and then, and I was like, I really think so. And kind of here's why. Right. And and I think, I think that's, it's actually, I think it's, that is internally having, being skeptical of just kind of going and jumping on hype trains is, is good. Cause it's like, I think you, you know, your focus and your time and what you're putting your weight into is the most important thing when you're a founder. I think for us, like we actually, again, like I had mentioned at the beginning of this, you know, we had tried bold and didn't see the results and that was like a two week sprint and we rolled it back. Right. This, this isn't viable at this point, but then when, you know, once we, once we saw real tangible results of, you know, some of the new stuff, right. Okay. That, that changes. Thanks. And I think a lot of it is, is two is going and finding that out for yourself and then going and talking to the smartest people, you know, with more domain knowledge on that stuff than you have and going, here's kind of what we found. Does this track? So when Sean and I met and he, and he, and you know, we keep, he and I kind of, he saw it, we talked through it and he said, this is your moment. I specifically remember that. Cause I, I walked away from that and I was like, holy s**t, this, this is it. Like this, you know, like Sean's Sean's at the intersection of web and AI and as like, it, you know, has one of the best perspectives on this stuff of, of anyone I know that put a huge wind in our sales, honestly, of just like, okay, let's, let's go and really, let's go and double down here because you know, we had conviction before, but having someone who's in the space independently kind of verify meant a lot, you know, so it makes me uncomfortable, but thank you. I get it. I mean, and I waited, I waited until I was pretty darn sure it was not going to be a waste of time toAlessio [01:32:12]: cool. Well, that's all I have. Yeah. And then on the personal side, you had a baby in April, you ran an Ironman in October. Now it's November.Swyx [01:32:20]: He did Ironman while launching ball. I was trying to schedule the call for him and he was like, Nope, I'm sorry. I'm swimming. I was like, Hey, I'm on the swimming session. For those who don't know, actually, I did not know. I don't even know the distance of an Ironman. 13 hours. Your time was 12, 12, 12, 12, 15, 12, 15.Eric [01:32:41]: Give me my minutes. No, no, I, it's, it can, it can completely depends on, you know, the course and just the, the, the person or whatever, right. And, but yeah, I mean, it's,Swyx [01:32:51]: it's 2.4 cam open water, 2.4 mile open water swim, a hundred KM, a hundred mile, a hundred KMEric [01:32:58]: cycle. I think it's like, I think it's 112 mile a bike and then marathon. Yeah. Full 26.2 mile marathon. Yeah. It was why. Yeah. And you weren't, you were not like a super endurance athlete before, right? Like let's like make this clear. Yeah. Kind of a wild, a wild thing. So I, you know, back when I did, we, we had our daughter in April and at that time we were, the future of the company was, you know, we're, we're figuring out what are we going to do here at that time. It was, it was pro just prior to bolt kind of getting kicked into, you know, the rebirth of it with the new models and stuff. And so I knew that it was going to be, you know, having, having a child is, you know, if you talk to anyone that's done that you're, you don't have a lot of sleep. It's it's, you know, there's a lot of, you know, to, to, to be a great parent is, is a ton of work. And then also being a startup CEO where there's a lot of uncertainty or whatever the way I've always found, like when I have to go and you kind of knock it out of the park and all aspects of my life is, is going, yeah, just to, to make it all aspects of my life. And so I was, I just won. Yeah. I woke up one day, I was like, all right, I'm going to do an Ironman this year and I burned the ships, bought the, it's cost a thousand bucks to do. These didn't know that. And, you know, just started, I'd never ran a marathon at that point. And so I think it was like 45 or 60 days after that, I ran a marathon. My brother-in-law, he's, that was even more insane two weeks before the marathon. I was like, Hey, you want to run a marathon in two weeks? He's like, sure. And, and just did it with me. He did not an endurance athlete either. Right. But anyway, so yeah, so I was training, ended up getting a coach who's usually go, you're kind of online. He's up in Marin. Great guy was on the U S Olympic team for triathlons. And when I told him, okay, I'm going to, I'm doing Ironman, California in three months, he was like, are you insane? You know, like, what are you, you know, you'd ask for my opinion, but like, I just want you to know, I don't think this is a good idea. I think, you know, like you shouldn't do this, et cetera. And I ended up doing it, you know, I ended up getting it done. And so he was like, okay, like that's pretty bad. But what makes you, what makes you ignore expert advice here? LikeSwyx [01:34:59]: most sane people would be, would be like, okay, I mean, you know what you're doing? Like,Eric [01:35:03]: I'll maybe wait a year. I think, and this is, this is kind of the, and the being a founder, right. It's, it's all about like, if you, like I mentioned earlier, it's like when we talk to people that worked on browser engines, they're like, you can't, you can't build what you're talking about. I think the job of a founder is, is to, is to solicit that advice. And, and what my coach actually said, he was right about certain things. There are certain areas where I was under indexed on, like, I was not, you know, spending nearly enough time on my bike, for example. Like after that, I was on my bike six hours a day on the weekends. That's a lot of time to spend in the saddle. Just like, just kind of, you know, and that was like, you know, for a couple of months leading up to it, he was right on, on certain aspects of it. And, but I kind of had to look internally and go, okay, like, what is he kind of missing about who I am and like, what I kind of know I'm capable of at this point. I mean, it was a nail biter. I mean, going into the thing, you know, it's, you get in, this is the same thing with launching bolt. It's like, or, or launching anything you get launch day, race day, you kind of go in, you're like, all right, here we go. Like we're going to, we're going to find out, we're going to find out, you know, how based in reality I was about all the decisions that led to this moment. And so I was going and doing the Ironman in like six months. Most people spend, you know, the, the folks he trains, usually it's, you know, one to two years on this stuff before you do try and do a full, you know, it's like going and kind of doing in that sort of timeframe. It's, it's, it's very similar to the same sort of skill set of going and building products. You have to really kind of look at the base reality and go make your own assessment onAlessio [01:36:24]: it. Right. So cool. Great. Sorry to wrap. Thank you so much here. Thanks for your time. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:38:39
  • The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic
    We have announced our first speaker, friend of the show Dylan Patel, and topic slates for Latent Space LIVE! at NeurIPS. Sign up for IRL/Livestream and to debate!We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!The vibe shift we observed in July - in favor of Claude 3.5 Sonnet, first introduced in June — has been remarkably long lived and persistent, surviving multiple subsequent updates of 4o, o1 and Gemini versions, for Anthropic’s Claude to end 2024 as the preferred model for AI Engineers and even being the exclusive choice for new code agents like bolt.new (our next guest on the pod!), which unlocked so much performance from Claude Sonnet that it went from $0 to $4m ARR in 4 weeks when it launched last month.Anthropic has now raised an additional $4b from Amazon and made an incredibly well received update of Claude 3.5 Sonnet (and Haiku), making significant improvements in performance over its predecessors:Solving SWE-BenchAs part of the October Sonnet release, Anthropic teased a blink-and-you’ll miss it result:The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.This was followed up by a blogpost a week later from today’s guest, Erik Schluntz, the engineer who implemented and scored this SOTA result using a simple, non-overengineered version of the SWE-Agent framework (you can see the submissions here). We have previously covered the SWE-Bench story extensively:* Speaking with SWEBench/SWEAgent authors at ICLR* Speaking with Cosine Genie, the previous SOTA (43.8%) on SWEBench Verified (with brief update at DevDay 2024)* Speaking with Shunyu Yao on SWEBench and the ReAct paradigm driving SWE-AgentOne of the notable inclusions in this blogpost are the tools that Erik decided to give Claude, e.g. the “Edit Tool”:The tools teased in the SWEBench submission/blogpost were then polished up and released with Computer Use…And you can also see even more computer use tools given in the new Model Context Protocol servers:Claude Computer UseBecause it is one of the best received AI releases of the year, we recommend watching the 2 minute Computer Use intro (and related demos) in its entirety:Eric also worked on Claude’s function calling, tool use, and computer use APIs, so we discuss that in the episode.Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.As you’ll see, this is very top of mind for Erik as a former Robotics founder who’s company basically used robots to interface with human physical systems like elevators.Full Video episodePlease like and subscribe!Show Notes* Eric Schluntz* “Raising the bar on SWE-Bench Verified”* Cobalt Robotics* SWE-Bench* SWE-Bench Verified* Human Eval & other benchmarks* Anthropic Workbench* Aider* Cursor* Fireworks AI* E2B* Amanda Askell* Toyota Research* Physical Intelligence (Pi)* Chelsea Finn* Josh Albrecht* Eric Jang* 1X* Dust* Cosine Episode* Bolt* Adept Episode* TauBench* LMSys EpisodeTimestamps* [00:00:00] Introductions* [00:03:39] What is SWE-Bench?* [00:12:22] SWE-Bench vs HumanEval vs others* [00:15:21] SWE-Agent architecture and runtime* [00:21:18] Do you need code indexing?* [00:24:50] Giving the agent tools* [00:27:47] Sandboxing for coding agents* [00:29:16] Why not write tests?* [00:30:31] Redesigning engineering tools for LLMs* [00:35:53] Multi-agent systems* [00:37:52] Why XML so good?* [00:42:57] Thoughts on agent frameworks* [00:45:12] How many turns can an agent do?* [00:47:12] Using multiple model types* [00:51:40] Computer use and agent use cases* [00:59:04] State of AI robotics* [01:04:24] Robotics in manufacturing* [01:05:01] Hardware challenges in robotics* [01:09:21] Is self-driving a good business?TranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner and CTO at Decibel Partners. And today we're in the new studio with my usual co-host, Shawn from Smol AI.Swyx [00:00:14]: Hey, and today we're very blessed to have Erik Schluntz from Anthropic with us. Welcome.Erik [00:00:19]: Hi, thanks very much. I'm Erik Schluntz. I'm a member of technical staff at Anthropic, working on tool use, computer use, and Swebench.Swyx [00:00:27]: Yeah. Well, how did you get into just the whole AI journey? I think you spent some time at SpaceX as well? Yeah. And robotics. Yeah. There's a lot of overlap between like the robotics people and the AI people, and maybe like there's some interlap or interest between language models for robots right now. Maybe just a little bit of background on how you got to where you are. Yeah, sure.Erik [00:00:50]: I was at SpaceX a long time ago, but before joining Anthropic, I was the CTO and co-founder of Cobalt Robotics. We built security and inspection robots. These are sort of five foot tall robots that would patrol through an office building or a warehouse looking for anything out of the ordinary. Very friendly, no tasers or anything. We would just sort of call a remote operator if we saw anything. We have about 100 of those out in the world, and had a team of about 100. We actually got acquired about six months ago, but I had left Cobalt about a year ago now, because I was starting to get a lot more excited about AI. I had been writing a lot of my code with things like Copilot, and I was like, wow, this is actually really cool. If you had told me 10 years ago that AI would be writing a lot of my code, I would say, hey, I think that's AGI. And so I kind of realized that we had passed this level, like, wow, this is actually really useful for engineering work. That got me a lot more excited about AI and learning about large language models. So I ended up taking a sabbatical and then doing a lot of reading and research myself and decided, hey, I want to go be at the core of this and joined Anthropic.Alessio [00:01:53]: And why Anthropic? Did you consider other labs? Did you consider maybe some of the robotics companies?Erik [00:02:00]: So I think at the time I was a little burnt out of robotics, and so also for the rest of this, any sort of negative things I say about robotics or hardware is coming from a place of burnout, and I reserve my right to change my opinion in a few years. Yeah, I looked around, but ultimately I knew a lot of people that I really trusted and I thought were incredibly smart at Anthropic, and I think that was the big deciding factor to come there. I was like, hey, this team's amazing. They're not just brilliant, but sort of like the most nice and kind people that I know, and so I just felt like I could be a really good culture fit. And ultimately, I do care a lot about AI safety and making sure that I don't want to build something that's used for bad purposes, and I felt like the best chance of that was joining Anthropic.Alessio [00:02:39]: And from the outside, these labs kind of look like huge organizations that have these obscureSwyx [00:02:44]: ways to organize.Alessio [00:02:45]: How did you get, you joined Anthropic, did you already know you were going to work on of the stuff you publish or you kind of join and then you figure out where you land? I think people are always curious to learn more.Erik [00:02:57]: Yeah, I've been very happy that Anthropic is very bottoms up and sort of very sort of receptive to whatever your interests are. And so I joined sort of being very transparent of like, hey, I'm most excited about code generation and AI that can actually go out and sort of touch the world or sort of help people build things. And, you know, those weren't my initial initial projects. I also came in and said, hey, I want to do the most valuable possible thing for this company and help Anthropic succeed. And, you know, like, let me find the balance of those. So I was working on lots of things at the beginning, you know, function calling, tool use. And then sort of as it became more and more relevant, I was like, oh, hey, like, let's it's time to go work on encoding agents and sort of started looking at SWE-Bench as sort of a really good benchmark for that.Swyx [00:03:39]: So let's get right into SWE-Bench. That's one of the many claims to fame. I feel like there's just been a series of releases related with Cloud 3.5 Sonnet around about two or three months ago, 3.5 Sonnet came out and it was it was a step ahead in terms of a lot of people immediately fell in love with it for coding. And then last month you released a new updated version of Cloud Sonnet. We're not going to talk about the training for that because that's still confidential. But I think Anthropic's done a really good job, like applying the model to different things. So you took the lead on SWE-Bench, but then also we're going to talk a little bit about computer use later on. So maybe just give us a context about why you looked at SWE-Bench Verified and you actually came up with a whole system for building agents that would maximally use the model well. Yeah.Erik [00:04:28]: So I'm on a sub team called Product Research. And basically the idea of product research is to really understand what end customers care about and want in the models and then work to try to make that happen. So we're not focused on sort of these more abstract general benchmarks like math problems or MMLU, but we really care about finding the things that are really valuable and making sure the models are great at those. And so because I've been interested in coding agents, I knew that this would be a really valuable thing. And I knew there were a lot of startups and our customers trying to build coding agents with our models. And so I said, hey, this is going to be a really good benchmark to be able to measure that and do well on it. And I wasn't the first person at Anthropic to find SWE-Bench, and there are lots of people that already knew about it and had done some internal efforts on it. It fell to me to sort of both implement the benchmark, which is very tricky, and then also to sort of make sure we had an agent and basically like a reference agent, maybe I'd call it, that could do very well on it. Ultimately, we want to provide how we implemented that reference agent so that people can build their own agents on top of our system and get sort of the most out of it as possible. So with this blog post we released on SWE-Bench, we released the exact tools and the prompt that we gave the model to be able to do well.Swyx [00:05:46]: For people who don't know, who maybe haven't dived into SWE-Bench, I think the general perception is they're like tasks that a software engineer could do. I feel like that's an inaccurate description because it is basically, one, it's a subset of like 12 repos. It's everything they could find that every issue with like a matching commit that could be tested. So that's not every commit. And then SWE-Bench verified is further manually filtered by OpenAI. Is that an accurate description and anything you'd change about that? Yes.Erik [00:06:14]: SWE-Bench is, it certainly is a subset of all tasks. It's first of all, it's only Python repos, so already fairly limited there. And it's just 12 of these popular open source repos. And yes, it's only ones where there were tests that passed at the beginning and also new tests that were introduced that test the new feature that's added. So it is, I think, a very limited subset of real engineering tasks. But I think it's also very valuable because even though it's a subset, it is true engineering tasks. And I think a lot of other benchmarks are really kind of these much more artificial setups of even if they're related to coding, they're more like coding interview style questions or puzzles that I think are very different from day-to-day what you end up doing. I don't know how frequently you all get to use recursion in your day-to-day job, but whenever I do, it's like a treat. And I think it's almost comical, and a lot of people joke about this in the industry, is how different interview questions are.Swyx [00:07:13]: Dynamic programming. Yeah, exactly.Erik [00:07:15]: Like, you code. From the day-to-day job. But I think one of the most interesting things about SWE-Bench is that all these other benchmarks are usually just isolated puzzles, and you're starting from scratch. Whereas SWE-Bench, you're starting in the context of an entire repository. And so it adds this entirely new dimension to the problem of finding the relevant files. And this is a huge part of real engineering, is it's actually pretty rare that you're starting something totally greenfield. You need to go and figure out where in a codebase you're going to make a change and understand how your work is going to interact with the rest of the systems. And I think SWE-Bench does a really good job of presenting that problem.Alessio [00:07:51]: Why do we still use human eval? It's like 92%, I think. I don't even know if you can actually get to 100% because some of the data is not actuallySwyx [00:07:59]: solvable.Alessio [00:08:00]: Do you see benchmarks like that, they should just get sunsetted? Because when you look at the model releases, it's like, oh, it's like 92% instead of like 89%, 90% on human eval versus, you know, SWE-Bench verified is you have 49%, right? Which is like, before 45% was state of the art, but maybe like six months ago it was like 30%, something like that. So is that a benchmark that you think is going to replace human eval, or do you think they're just going to run in parallel?Erik [00:08:27]: I think there's still need for sort of many different varied evals. Like sometimes you do really care about just sort of greenfield code generation. And so I don't think that everything needs to go to sort of an agentic setup.Swyx [00:08:39]: It would be very expensive to implement.Erik [00:08:41]: The other thing I was going to say is that SWE-Bench is certainly hard to implement and expensive to run because each task, you have to parse, you know, a lot of the repo to understand where to put your code. And a lot of times you take many tries of writing code, running it, editing it. It can use a lot of tokens compared to something like human eval. So I think there's definitely a space for these more traditional coding evals that are sort of easy to implement, quick to run, and do get you some signal. Maybe hopefully there's just sort of harder versions of human eval that get created.Alessio [00:09:14]: How do we get SWE-Bench verified to 92%? Do you think that's something where it's like line of sight to it, or it's like, you know, we need a whole lot of things to go right? Yeah, yeah.Erik [00:09:23]: And actually, maybe I'll start with SWE-Bench versus SWE-Bench verified, which is I think something I missed earlier. So SWE-Bench is, as we described, this big set of tasks that were scraped.Swyx [00:09:33]: Like 12,000 or something?Erik [00:09:34]: Yeah, I think it's 2,000 in the final set. But a lot of those, even though a human did them, they're actually impossible given the information that comes with the task. The most classic example of this is the test looks for a very specific error string. You know, like assert message equals error, something, something, something. And unless you know that's exactly what you're looking for, there's no way the model is going to write that exact same error message, and so the tests are going to fail. So SWE-Bench verified was actually made in partnership with OpenAI, and they hired humans to go review all these tasks and pick out a subset to try to remove any obstacle like this that would make the tasks impossible. So in theory, all of these tasks should be fully doable by the model. And they also had humans grade how difficult they thought the problems would be. Between less than 15 minutes, I think 15 minutes to an hour, an hour to four hours, and greater than four hours. So that's kind of this interesting sort of how big the problem is as well. To get to SWE-Bench verified to 90%, actually, maybe I'll also start off with some of the remaining failures that I see when running our model on SWE-Bench. I'd say the biggest cases are the model sort of operates at the wrong level of abstraction. And what I mean by that is the model puts in maybe a smaller band-aid when really the task is asking for a bigger refactor. And some of those, you know, is the model's fault, but a lot of times if you're just sort of seeing the GitHub issue, it's not exactly clear which way you should do. So even though these tasks are possible, there's still some ambiguity in how the tasks are described. That being said, I think in general, language models frequently will produce a smaller diff when possible, rather than trying to do a big refactor. I think another area, at least the agent we created, didn't have any multimodal abilities, even though our models are very good at vision. So I think that's just a missed opportunity. And if I read through some of the traces, there's some funny things where, especially the tasks on matplotlib, which is a graphing library, the test script will save an image and the model will just say, okay, it looks great, you know, without looking at it. So there's certainly extra juice to squeeze there of just making sure the model really understands all the sides of the input that it's given, including multimodal. But yeah, I think like getting to 92%. So this is something that I have not looked at, but I'm very curious about. I want someone to look at, like, what is the union of all of the different tasks that have been solved by at least one attempt at SWE-Bench Verified. There's a ton of submissions to the benchmark, and so I'd be really curious to see how many of those 500 tasks at least someone has solved. And I think, you know, there's probably a bunch that none of the attempts have ever solved. And I think it'd be interesting to look at those and say, hey, is there some problem with these? Like, are these impossible? Or are they just really hard and only a human could do them?Swyx [00:12:22]: Yeah, like specifically, is there a category of problems that are still unreachable by any LLM agent? Yeah, yeah. And I think there definitely are.Erik [00:12:28]: The question is, are those fairly inaccessible or are they just impossible because of the descriptions? But I think certainly some of the tasks, especially the ones that the human graders reviewed as like taking longer than four hours are extremely difficult. I think we got a few of them right, but not very many at all in the benchmark.Swyx [00:12:49]: And did those take less than four hours?Erik [00:12:51]: They certainly did less than, yeah, than four hours.Swyx [00:12:54]: Is there a correlation of length of time with like human estimated time? You know what I mean? Or do we have sort of more of X paradox type situations where it's something super easy for a model, but hard for a human?Erik [00:13:06]: I actually haven't done the stats on that, but I think that'd be really interesting to see of like how many tokens does it take and how is that correlated with difficulty? What is the likelihood of success with difficulty? I think actually a really interesting thing that I saw, one of my coworkers who was also working on this named Simon, he was focusing just specifically on the very hard problems, the ones that are said to take longer than four hours. And he ended up sort of creating a much more detailed prompt than I used. And he got a higher score on the most difficult subset of problems, but a lower score overall on the whole benchmark. And the prompt that I made, which is sort of much more simple and bare bones, got a higher score on the overall benchmark, but lower score on the really hard problems. And I think some of that is the really detailed prompt made the model sort of overcomplicate a lot of the easy problems, because honestly, a lot of the suite bench problems, they really do just ask for a bandaid where it's like, hey, this crashes if this is none, and really all you need to do is put a check if none. And so sometimes trying to make the model think really deeply, it'll think in circles and overcomplicate something, which certainly human engineers are capable of as well. But I think there's some interesting thing of the best prompt for hard problems might not be the best prompt for easy problems.Alessio [00:14:19]: How do we fix that? Are you supposed to fix it at the model level? How do I know what prompt I'm supposed to use?Swyx [00:14:25]: Yeah.Erik [00:14:26]: And I'll say this was a very small effect size, and so I think this isn't worth obsessing over. I would say that as people are building systems around agents, I think the more you can separate out the different kinds of work the agent needs to do, the better you can tailor a prompt for that task. And I think that also creates a lot of like, for instance, if you were trying to make an agent that could both solve hard programming tasks, and it could just write quick test files for something that someone else had already made, the best way to do those two tasks might be very different prompts. I see a lot of people build systems where they first sort of have a classification, and then route the problem to two different prompts. And that's sort of a very effective thing, because one, it makes the two different prompts much simpler and smaller, and it means you can have someone work on one of the prompts without any risk of affecting the other tasks. So it creates like a nice separation of concerns. Yeah.Alessio [00:15:21]: And the other model behavior thing you mentioned, they prefer to generate like shorter diffs. Why is that? Like, is there a way? I think that's maybe like the lazy model question that people have is like, why are you not just generating the whole code instead of telling me to implement it?Swyx [00:15:36]: Are you saving tokens? Yeah, exactly. It's like conspiracy theory. Yeah. Yeah.Erik [00:15:41]: Yeah. So there's two different things there. One is like the, I'd say maybe like doing the easier solution rather than the hard solution. And I'd say the second one, I think what you're talking about is like the lazy model is like when the model says like dot, dot, dot, code remains the same.Swyx [00:15:52]: Code goes here. Yeah. I'm like, thanks, dude.Erik [00:15:55]: But honestly, like that just comes as like people on the internet will do stuff like that. And like, dude, if you're talking to a friend and you ask them like to give you some example code, they would definitely do that. They're not going to reroll the whole thing. And so I think that's just a matter of like, you know, sometimes you actually do just, just want like the relevant changes. And so I think it's, this is something where a lot of times like, you know, the models aren't good at mind reading of like which one you want. So I think that like the more explicit you can be in prompting to say, Hey, you know, give me the entire thing, no, no elisions versus just give me the relevant changes. And that's something, you know, we want to make the models always better at following those kinds of instructions.Swyx [00:16:32]: I'll drop a couple of references here. We're recording this like a day after Dario, Lex Friedman just dropped his five hour pod with Dario and Amanda and the rest of the crew. And Dario actually made this interesting observation that like, we actually don't want, we complain about models being too chatty in text and then not chatty enough in code. And so like getting that right is kind of a awkward bar because, you know, you, you don't want it to yap in its responses, but then you also want it to be complete in, in code. And then sometimes it's not complete. Sometimes you just want it to diff, which is something that Enthopic has also released with a, you know, like the, the fast edit stuff that you guys did. And then the other thing I wanted to also double back on is the prompting stuff. You said, you said it was a small effect, but it was a noticeable effect in terms of like picking a prompt. I think we'll go into suite agent in a little bit, but I kind of reject the fact that, you know, you need to choose one prompt and like have your whole performance be predicated on that one prompt. I think something that Enthopic has done really well is meta prompting, prompting for a prompt. And so why can't you just develop a meta prompt for, for all the other prompts? And you know, if it's a simple task, make a simple prompt, if it's a hard task, make a hard prompt. Obviously I'm probably hand-waving a little bit, but I will definitely ask people to try the Enthopic Workbench meta prompting system if they haven't tried it yet. I went to the Build Day recently at Enthopic HQ, and it's the closest I've felt to an AGI, like learning how to operate itself that, yeah, it's, it's, it's really magical.Erik [00:17:57]: Yeah, no, Claude is great at writing prompts for Claude.Swyx [00:18:00]: Right, so meta prompting. Yeah, yeah.Erik [00:18:02]: The way I think about this is that humans, even like very smart humans still use sort of checklists and use sort of scaffolding for themselves. Surgeons will still have checklists, even though they're incredible experts. And certainly, you know, a very senior engineer needs less structure than a junior engineer, but there still is some of that structure that you want to keep. And so I always try to anthropomorphize the models and try to think about for a human sort of what is the equivalent. And that's sort of, you know, how I think about these things is how much instruction would you give a human with the same task? And do you, would you need to give them a lot of instruction or a little bit of instruction?Alessio [00:18:36]: Let's talk about the agent architecture maybe. So first, runtime, you let it run until it thinks it's done or it reaches 200k context window.Swyx [00:18:45]: How did you come up? What's up with that?Erik [00:18:47]: Yeah.Swyx [00:18:48]: Yeah.Erik [00:18:49]: I mean, this, so I'd say that a lot of previous agent work built sort of these very hard coded and rigid workflows where the model is sort of pushed through certain flows of steps. And I think to some extent, you know, that's needed with smaller models and models that are less smart. But one of the things that we really wanted to explore was like, let's really give Claude the reins here and not force Claude to do anything, but let Claude decide, you know, how it should approach the problem, what steps it should do. And so really, you know, what we did is like the most extreme version of this is just give it some tools that it can call and it's able to keep calling the tools, keep thinking, and then yeah, keep doing that until it thinks it's done. And that's sort of the most, the most minimal agent framework that we came up with. And I think that works very well. I think especially the new Sonnet 3.5 is very, very good at self-correction, has a lot of like grit. Claude will try things that fail and then try, you know, come back and sort of try different approaches. And I think that's something that you didn't see in a lot of previous models. Some of the existing agent frameworks that I looked at, they had whole systems built to try to detect loops and see, oh, is the model doing the same thing, you know, more than three times, then we have to pull it out. And I think like the smarter the models are, the less you need that kind of extra scaffolding. So yeah, just giving the model tools and letting it keep sample and call tools until it thinks it's done was the most minimal framework that we could think of. And so that's what we did.Alessio [00:20:18]: So you're not pruning like bad paths from the context. If it tries to do something, it fails. You just burn all these tokens.Swyx [00:20:25]: Yes.Erik [00:20:26]: I would say the downside of this is that this is sort of a very token expensive way to doSwyx [00:20:29]: this. But still, it's very common to prune bad paths because models get stuck. Yeah.Erik [00:20:35]: But I'd say that, yeah, 3.5 is not getting stuck as much as previous models. And so, yeah, we wanted to at least just try the most minimal thing. Now, I would say that, you know, this is definitely an area of future research, especially if we talk about these problems that are going to take a human more than four hours. Those might be things where we're going to need to go prune bad paths to let the model be able to accomplish this task within 200k tokens. So certainly I think there's like future research to be done in that area, but it's not necessary to do well on these benchmarks.Swyx [00:21:06]: Another thing I always have questions about on context window things, there's a mini cottage industry of code indexers that have sprung up for large code bases, like the ones in SweetBench. You didn't need them? We didn't.Erik [00:21:18]: And I think I'd say there's like two reasons for this. One is like SweetBench specific and the other is a more general thing. The more general thing is that I think Sonnet is very good at what we call agentic search. And what this basically means is letting the model decide how to search for something. It gets the results and then it can decide, should it keep searching or is it done? Does it have everything it needs? So if you read through a lot of the traces of the SweetBench, the model is calling tools to view directories, list out things, view files. And it will do a few of those until it feels like it's found the file where the bug is. And then it will start working on that file. And I think like, again, this is all, everything we did was about just giving Claude the full reins. So there's no hard-coded system. There's no search system that you're relying on getting the correct files into context. This just totally lets Claude do it.Swyx [00:22:11]: Or embedding things into a vector database. Exactly. Oops. No, no.Erik [00:22:17]: This is very, very token expensive. And so certainly, and it also takes many, many turns. And so certainly if you want to do something in a single turn, you need to do RAG and just push stuff into the first prompt.Alessio [00:22:28]: And just to make it clear, it's using the Bash tool, basically doing LS, looking at files and then doing CAD for the following context. It can do that.Erik [00:22:35]: But it's file editing tool also has a command in it called view that can view a directory. It's very similar to LS, but it just sort of has some nice sort of quality of life improvements. So I think it'll only do an LS sort of two directories deep so that the model doesn't get overwhelmed if it does this on a huge file. I would say actually we did more engineering of the tools than the overall prompt. But the one other thing I want to say about this agentic search is that for SWE-Bench specifically, a lot of the tasks are bug reports, which means they have a stack trace in them. And that means right in that first prompt, it tells you where to go. And so I think this is a very easy case for the model to find the right files versus if you're using this as a general coding assistant where there isn't a stack trace or you're asking it to insert a new feature, I think there it's much harder to know which files to look at. And that might be an area where you would need to do more of this exhaustive search where an agentic search would take way too long.Swyx [00:23:33]: As someone who spent the last few years in the JS world, it'd be interesting to see SWE-Bench JS because these stack traces are useless because of so much virtualization that we do. So they're very, very disconnected with where the code problems are actually appearing.Erik [00:23:50]: That makes me feel better about my limited front-end experience, as I've always struggled with that problem.Swyx [00:23:55]: It's not your fault. We've gotten ourselves into a very, very complicated situation. And I'm not sure it's entirely needed. But if you talk to our friends at Vercel, they will say it is.Erik [00:24:04]: I will say SWE-Bench just released SWE-Bench Multimodal, which I believe is either entirely JavaScript or largely JavaScript. And it's entirely things that have visual components of them.Swyx [00:24:15]: Are you going to tackle that? We will see.Erik [00:24:17]: I think it's on the list and there's interest, but no guarantees yet.Swyx [00:24:20]: Just as a side note, it occurs to me that every model lab, including Enthopic, but the others as well, you should have your own SWE-Bench, whatever your bug tracker tool. This is a general methodology that you can use to track progress, I guess.Erik [00:24:34]: Yeah, sort of running on our own internal code base.Swyx [00:24:36]: Yeah, that's a fun idea.Alessio [00:24:37]: Since you spend so much time on the tool design, so you have this edit tool that can make changes and whatnot. Any learnings from that that you wish the AI IDEs would take in? Is there some special way to look at files, feed them in?Erik [00:24:50]: I would say the core of that tool is string replace. And so we did a few different experiments with different ways to specify how to edit a file. And string replace, basically, the model has to write out the existing version of the string and then a new version, and that just gets swapped in. We found that to be the most reliable way to do these edits. Other things that we tried were having the model directly write a diff, having the model fully regenerate files. That one is actually the most accurate, but it takes so many tokens, and if you're in a very big file, it's cost prohibitive. There's basically a lot of different ways to represent the same task. And they actually have pretty big differences in terms of model accuracy. I think Eider, they have a really good blog where they explore some of these different methods for editing files, and they post results about them, which I think is interesting. But I think this is a really good example of the broader idea that you need to iterate on tools rather than just a prompt. And I think a lot of people, when they make tools for an LLM, they kind of treat it like they're just writing an API for a computer, and it's sort of very minimal. It's sort of just the bare bones of what you'd need, and honestly, it's so hard for the models to use those. Again, I come back to anthropomorphizing these models. Imagine you're a developer, and you just read this for the very first time, and you're trying to use it. You can do so much better than just sort of the bare API spec of what you'd often see. Include examples in the description. Include really detailed explanations of how things work. And I think that, again, also think about what is the easiest way for the model to represent the change that it wants to make. For file editing, as an example, writing a diff is actually... Let's take the most extreme example. You want the model to literally write a patch file. I think patch files have at the very beginning numbers of how many total lines change. That means before the model has actually written the edit, it needs to decide how many numbers or how many lines are going to change.Swyx [00:26:52]: Don't quote me on that.Erik [00:26:54]: I think it's something like that, but I don't know if that's exactly the diff format. But you can certainly have formats that are much easier to express without messing up than others. And I like to think about how much human effort goes into designing human interfaces for things. It's incredible. This is entirely what FrontEnd is about, is creating better interfaces to kind of do the same things. And I think that same amount of attention and effort needs to go into creating agent computer interfaces.Swyx [00:27:19]: It's a topic we've discussed, ACI or whatever that looks like. I would also shout out that I think you released some of these toolings as part of computer use as well. And people really liked it. It's all open source if people want to check it out. I'm curious if there's an environment element that complements the tools. So how do you... Do you have a sandbox? Is it just Docker? Because that can be slow or resource intensive. Do you have anything else that you would recommend?Erik [00:27:47]: I don't think I can talk about sort of public details or about private details about how we implement our sandboxing. But obviously, we need to have sort of safe, secure, and fast sandboxes for training for the models to be able to practice writing code and working in an environment.Swyx [00:28:03]: I'm aware of a few startups working on agent sandboxing. E2B is a close friend of ours that Alessio has led around in, but also I think there's others where they're focusing on snapshotting memory so that it can do time travel for debugging. Computer use where you can control the mouse or keyboard or something like that. Whereas here, I think that the kinds of tools that we offer are very, very limited to coding agent work cases like bash, edit, you know, stuff like that. Yeah.Erik [00:28:30]: I think the computer use demo that we released is an extension of that. It has the same bash and edit tools, but it also has the computer tool that lets it get screenshots and move the mouse and keyboard. Yeah. So I definitely think there's sort of more general tools there. And again, the tools we released as part of SweetBench were, I'd say they're very specific for like editing files and doing bash, but at the same time, that's actually very general if you think about it. Like anything that you would do on a command line or like editing files, you can do with those tools. And so we do want those tools to feel like any sort of computer terminal work could be done with those same tools rather than making tools that were like very specific for SweetBench like run tests as its own tool, for instance. Yeah.Swyx [00:29:15]: You had a question about tests.Alessio [00:29:16]: Yeah, exactly. I saw there's no test writer tool. Is it because it generates the code and then you're running it against SweetBench anyway, so it doesn't really need to write the test or?Swyx [00:29:26]: Yeah.Erik [00:29:27]: So this is one of the interesting things about SweetBench is that the tests that the model's output is graded on are hidden from it. That's basically so that the model can't cheat by looking at the tests and writing the exact solution. And I'd say typically the model, the first thing it does is it usually writes a little script to reproduce the error. And again, most SweetBench tasks are like, hey, here's a bug that I found. I run this and I get this error. So the first thing the model does is try to reproduce that. So it's kind of been rerunning that script as a mini test. But yeah, sometimes the model will like accidentally introduce a bug that breaks some other tests and it doesn't know about that.Alessio [00:30:05]: And should we be redesigning any tools? We kind of talked about this and like having more examples, but I'm thinking even things of like Q as a query parameter in many APIs, it's like easier for the model to like re-query than read the Q. I'm sure it learned the Q by this point, but like, is there anything you've seen like building this where it's like, hey, if I were to redesign some CLI tools, some API tool, I would like change the way structure to make it better for LLMs?Erik [00:30:31]: I don't think I've thought enough about that off the top of my head, but certainly like just making everything more human friendly, like having like more detailed documentation and examples. I think examples are really good in things like descriptions, like so many, like just using the Linux command line, like how many times I do like dash dash help or look at the man page or something. It's like, just give me one example of like how I actually use this. Like I don't want to go read through a hundred flags. Just give me the most common example. But again, so you know, things that would be useful for a human, I think are also very useful for a model.Swyx [00:31:03]: Yeah. I mean, there's one thing that you cannot give to code agents that is useful for human is this access to the internet. I wonder how to design that in, because one of the issues that I also had with just the idea of a suite bench is that you can't do follow up questions. You can't like look around for similar implementations. These are all things that I do when I try to fix code and we don't do that. It's not, it wouldn't be fair, like it'd be too easy to cheat, but then also it's kind of not being fair to these agents because they're not operating in a real world situation. Like if I had a real world agent, of course I'm giving it access to the internet because I'm not trying to pass a benchmark. I don't have a question in there more, more just like, I feel like the most obvious tool access to the internet is not being used.Erik [00:31:47]: I think that that's really important for humans, but honestly the models have so much general knowledge from pre-training that it's, it's like less important for them. I feel like versioning, you know, if you're working on a newer thing that was like, they came after the knowledge cutoff, then yes, I think that's very important. I think actually this, this is like a broader problem that there is a divergence between Sweebench and like what customers will actually care about who are working on a coding agent for real use. And I think one of those there is like internet access and being able to like, how do you pull in outside information? I think another one is like, if you have a real coding agent, you don't want to have it start on a task and like spin its wheels for hours because you gave it a bad prompt. You want it to come back immediately and ask follow up questions and like really make sure it has a very detailed understanding of what to do, then go off for a few hours and do work. So I think that like real tasks are going to be much more interactive with the agent rather than this kind of like one shot system. And right now there's no benchmark that, that measures that. And maybe I think it'd be interesting to have some benchmark that is more interactive. I don't know if you're familiar with TauBench, but it's a, it's a customer service benchmark where there's basically one LLM that's playing the user or the customer that's getting support and another LLM that's playing the support agent and they interact and try to resolve the issue.Swyx [00:33:08]: Yeah. We talked to the LMSIS guys. Awesome. And they also did MTBench for people listening along. So maybe we need MTSWE-Bench. Sure. Yeah.Erik [00:33:16]: So maybe, you know, you could have something where like before the SWE-Bench task starts, you have like a few back and forths with kind of like the, the author who can answer follow up questions about what they want the task to do. And of course you'd need to do that where it doesn't cheat and like just get the exact, the exact thing out of the human or out of the sort of user. But I think that would be a really interesting thing to see. If you look at sort of existing agent work, like a Repl.it's coding agent, I think one of the really great UX things they do is like first having the agent create a plan and then having the human approve that plan or give feedback. I think for agents in general, like having a planning step at the beginning, one, just having that plan will improve performance on the downstream task just because it's kind of like a bigger chain of thought, but also it's just such a better UX. It's way easier for a human to iterate on a plan with a model rather than iterating on the full task that sort of has a much slower time through each loop. If the human has approved this implementation plan, I think it makes the end result a lot more sort of auditable and trustable. So I think there's a lot of things sort of outside of SweetBench that will be very important for real agent usage in the world. Yeah.Swyx [00:34:27]: I will say also, there's a couple of comments on names that you dropped. Copilot also does the plan stage before it writes code. I feel like those approaches have generally been less Twitter successful because it's not prompt to code, it's prompt plan code. You know, so there's a little bit of friction in there, but it's not much. Like it's, it actually, it's, it, you get a lot for what it's worth. I also like the way that Devin does it, where you can sort of edit the plan as it goes along. And then the other thing with Repl.it, we had a, we hosted a sort of dev day pregame with Repl.it and they also commented about multi-agents. So like having two agents kind of bounce off of each other. I think it's a similar approach to what you're talking about with kind of the few shot example, just as in the prompts of clarifying what the agent wants. But typically I think this would be implemented as a tool calling another agent, like a sub-agent I don't know if you explored that, do you like that idea?Erik [00:35:20]: I haven't explored this enough, but I've definitely heard of people having good success with this. Of almost like basically having a few different sort of personas of agents, even if they're all the same LLM. I think this is one thing with multi-agent that a lot of people will kind of get confused by is they think it has to be different models behind each thing. But really it's sort of usually the same, the same model with different prompts. And yet having one, having them have different personas to kind of bring different sort of thoughts and priorities to the table. I've seen that work very well and sort of create a much more thorough and thought outSwyx [00:35:53]: response.Erik [00:35:53]: I think the downside is just that it adds a lot of complexity and it adds a lot of extra tokens. So I think it depends what you care about. If you want a plan that's very thorough and detailed, I think it's great. If you want a really quick, just like write this function, you know, you probably don't want to do that and have like a bunch of different calls before it does this.Alessio [00:36:11]: And just talking about the prompt, why are XML tags so good in Cloud? I think initially people were like, oh, maybe you're just getting lucky with XML. But I saw obviously you use them in your own agent prompts, so they must work. And why is it so model specific to your family?Erik [00:36:26]: Yeah, I think that there's, again, I'm not sure how much I can say, but I think there's historical reasons that internally we've preferred XML. I think also the one broader thing I'll say is that if you look at certain kinds of outputs, there is overhead to outputting in JSON. If you're trying to output code in JSON, there's a lot of extra escaping that needs to be done, and that actually hurts model performance across the board. Versus if you're in just a single XML tag, there's none of that sort of escaping thatSwyx [00:36:58]: needs to happen.Erik [00:36:58]: That being said, I haven't tried having it write HTML and XML, which maybe then you start running into weird escaping things there. I'm not sure. But yeah, I'd say that's some historical reasons, and there's less overhead of escaping.Swyx [00:37:12]: I use XML in other models as well, and it's just a really nice way to make sure that the thing that ends is tied to the thing that starts. That's the only way to do code fences where you're pretty sure example one start, example one end, that is one cohesive unit.Alessio [00:37:30]: Because the braces are nondescriptive. Yeah, exactly.Swyx [00:37:33]: That would be my simple reason. XML is good for everyone, not just Cloud. Cloud was just the first one to popularize it, I think.Erik [00:37:39]: I do definitely prefer to read XML than read JSON.Alessio [00:37:43]: Any other details that are maybe underappreciated? I know, for example, you had the absolute paths versus relative. Any other fun nuggets?Erik [00:37:52]: I think that's a good sort of anecdote to mention about iterating on tools. Like I said, spend time prompt engineering your tools, and don't just write the prompt, but write the tool, and then actually give it to the model and read a bunch of transcripts about how the model tries to use the tool. I think by doing that, you will find areas where the model misunderstands a tool or makes mistakes, and then basically change the tool to make it foolproof. There's this Japanese term, pokayoke, about making tools mistake-proof. You know, the classic idea is you can have a plug that can fit either way, and that's dangerous, or you can make it asymmetric so that it can't fit this way, it has to go like this, and that's a better tool because you can't use it the wrong way. So for this example of absolute paths, one of the things that we saw while testing these tools is, oh, if the model has done CD and moved to a different directory, it would often get confused when trying to use the tool because it's now in a different directory, and so the paths aren't lining up. So we said, oh, well, let's just force the tool to always require an absolute path, and then that's easy for the model to understand. It knows sort of where it is. It knows where the files are. And then once we have it always giving absolute paths, it never messes up even, like, no matter where it is because it just, if you're using an absolute path, it doesn't matter whereSwyx [00:39:13]: you are.Erik [00:39:13]: So iterations like that, you know, let us make the tool foolproof for the model. I'd say there's other categories of things where we see, oh, if the model, you know, opens vim, like, you know, it's never going to return. And so the tool is stuck.Swyx [00:39:28]: Did it get stuck? Yeah. Get out of vim. What?Erik [00:39:31]: Well, because the tool is, like, it just text in, text out. It's not interactive. So it's not like the model doesn't know how to get out of vim. It's that the way that the tool is, like, hooked up to the computer is not interactive. Yes, I mean, there is the meme of no one knows how to get out of vim. You know, basically, we just added instructions in the tool of, like, hey, don't launch commands that don't return.Swyx [00:39:54]: Yeah, like, don't launch vim.Erik [00:39:55]: Don't launch whatever. If you do need to do something, you know, put an ampersand after it to launch it in the background. And so, like, just, you know, putting kind of instructions like that just right in the description for the tool really helps the model. And I think, like, that's an underutilized space of prompt engineering, where, like, people might try to do that in the overall prompt, but just put that in the tool itself so the model knows that it's, like, for this tool, this is what's relevant.Swyx [00:40:20]: You said you worked on the function calling and tool use before you actually started this vBench work, right? Was there any surprises? Because you basically went from creator of that API to user of that API. Any surprises or changes you would make now that you have extensively dog-fooded in a state-of-the-art agent?Erik [00:40:39]: I want us to make, like, maybe, like, a little bit less verbose SDK. I think some way, like, right now, it just takes, I think we sort of force people to do the best practices of writing out sort of these full JSON schemas, but it would be really nice if you could just pass in a Python function as a tool. I think that could be something nice.Swyx [00:40:58]: I think that there's a lot of, like, Python- There's helper libraries. ... structure, you know. I don't know if there's anyone else that is specializing for Anthropic. Maybe Jeremy Howard's and Simon Willis's stuff. They all have Cloud-specific stuff that they are working on. Cloudette. Cloudette, exactly. I also wanted to spend a little bit of time with SuiteAgent. It seems like a very general framework. Like, is there a reason you picked it apart from it's the same authors as vBench, or?Erik [00:41:21]: The main thing we wanted to go with was the same authors as vBench, so it just felt sort of like the safest, most neutral option. And it was, you know, very high quality. It was very easy to modify, to work with. I would say it also actually, their underlying framework is sort of this, it's like, youSwyx [00:41:39]: know, think, act, observe.Erik [00:41:40]: That they kind of go through this loop, which is like a little bit more hard-coded than what we wanted to do, but it's still very close. That's still very general. So it felt like a good match as sort of the starting point for our agent. And we had already sort of worked with and talked with the SWE-Bench people directly, so it felt nice to just have, you know, we already know the authors. This will be easy to work with.Swyx [00:42:00]: I'll share a little bit of like, this all seems disconnected, but once you figure out the people and where they go to school, it all makes sense. So it's all Princeton. Yeah, the SWE-Bench and SuiteAgent.Erik [00:42:11]: It's a group out of Princeton.Swyx [00:42:12]: Yeah, and we had Shun Yu on the pod, and he came up with the React paradigm, and that's think, act, observe. That's all React. So they're all friends. Yep, yeah, exactly.Erik [00:42:22]: And you know, if you actually read our traces of our submission, you can actually see like think, act, observe in our logs. And we just didn't even change the printing code. So it's like doing still function calls under the hood, and the model can do sort of multiple function calls in a row without thinking in between if it wants to. But yeah, so a lot of similarities and a lot of things we inherited from SuiteAgent just as a starting point for the framework.Alessio [00:42:47]: Any thoughts about other agent frameworks? I think there's, you know, the whole gamut from very simple to like very complex.Swyx [00:42:53]: Autogen, CooEI, LandGraph. Yeah, yeah.Erik [00:42:56]: I think I haven't explored a lot of them in detail. I would say with agent frameworks in general, they can certainly save you some like boilerplate. But I think there's actually this like downside of making agents too easy, where you end up very quickly like building a much more complex system than you need. And suddenly, you know, instead of having one prompt, you have five agents that are talking to each other and doing a dialogue. And it's like, because the framework made that 10 lines to do, you end up building something that's way too complex. So I think I would actually caution people to like try to start without these frameworks if you can, because you'll be closer to the raw prompts and be able to sort of directly understand what's going on. I think a lot of times these frameworks also, by trying to make everything feel really magical, you end up sort of really hiding what the actual prompt and output of the model is, and that can make it much harder to debug. So certainly these things have a place, and I think they do really help at getting rid of boilerplate, but they come with this cost of obfuscating what's really happening and making it too easy to very quickly add a lot of complexity. So yeah, I would recommend people to like try it from scratch, and it's like not that bad.Alessio [00:44:08]: Would you rather have like a framework of tools? Do you almost see like, hey, it's maybe easier to get tools that are already well curated, like the ones that you build, if I had an easy way to get the best tool from you, andSwyx [00:44:21]: like you maintain the definition?Alessio [00:44:22]: Or yeah, any thoughts on how you want to formalize tool sharing?Erik [00:44:26]: Yeah, I think that's something that we're certainly interested in exploring, and I think there is space for sort of these general tools that will be very broadly applicable. But at the same time, most people that are building on these, they do have much more specific things that they're trying to do. You know, I think that might be useful for hobbyists and demos, but the ultimate end applications are going to be bespoke. And so we just want to make sure that the model's great at any tool that it uses. But certainly something we're exploring.Alessio [00:44:52]: So everything bespoke, no frameworks, no anything.Swyx [00:44:55]: Just for now, for now.Erik [00:44:56]: Yeah, I would say that like the best thing I've seen is people building up from like, build some good util functions, and then you can use those as building blocks. Yeah, yeah.Alessio [00:45:05]: I have a utils folder, or like all these scripts. My framework is like def, call, and tropic. And then I just put all the defaults.Swyx [00:45:12]: Yeah, exactly. There's a startup hidden in every utils folder, you know? No, totally not. Like, if you use it enough, like it's a startup, you know? At some point. I'm kind of curious, is there a maximum length of turns that it took? Like, what was the longest run? I actually don't.Erik [00:45:27]: I mean, it had basically infinite turns until it ran into a 200k context. I should have looked this up. I don't know. And so for some of those failed cases where it eventually ran out of context, I mean, it was over 100 turns. I'm trying to remember like the longest successful run, but I think it was definitely over 100 turns that some of the times.Swyx [00:45:48]: Which is not that much. It's a coffee break. Yeah.Erik [00:45:52]: But certainly, you know, these things can be a lot of turns. And I think that's because some of these things are really hard, where it's going to take, you know, many tries to do it. And if you think about like, think about a task that takes a human four hours to do. Think about how many different files you read, and like times you edit a file in four hours. That's a lot more than 100.Alessio [00:46:10]: How many times you open Twitter because you get distracted. But if you had a lot more compute, what's kind of like the return on the extra compute now? So like, you know, if you had thousands of turns or like whatever, like how much better would it get?Erik [00:46:23]: Yeah, this I don't know. And I think this is, I think sort of one of the open areas of research in general with agents is memory and sort of how do you have something that can do work beyond its context length where you're just purely appending. So you mentioned earlier things like pruning bad paths. I think there's a lot of interesting work around there. Can you just roll back but summarize, hey, don't go down this path? There be dragons. Yeah, I think that's very interesting that you could have something that that uses way more tokens without ever using at a time more than 200k. So I think that's very interesting. I think the biggest thing is like, can you make the model sort of losslessly summarize what it's learned from trying different approaches and bring things back? I think that's sort of the big challenge.Swyx [00:47:11]: What about different models?Alessio [00:47:12]: So you have Haiku, which is like, you know, cheaper. So you're like, well, what if I have a Haiku to do a lot of these smaller things and then put it back up?Erik [00:47:20]: I think Cursor might have said that they actually have a separate model for file editing.Swyx [00:47:25]: I'm trying to remember.Erik [00:47:25]: I think they were on maybe the Lex Fridman podcast where they said they have a bigger model, like write what the code should be and then a different model, like apply it. So I think there's a lot of interesting room for stuff like that. Yeah, fast supply.Swyx [00:47:37]: We actually did a pod with Fireworks that they worked with on. It's speculative decoding.Erik [00:47:41]: But I think there's also really interesting things about like, you know, paring down input tokens as well, especially sometimes the models trying to read like a 10,000 line file. That's a lot of tokens. And most of it is actually not going to be relevant. I think it'd be really interesting to like delegate that to Haiku. Haiku read this file and just pull out the most relevant functions. And then, you know, Sonnet reads just those and you save 90% on tokens. I think there's a lot of really interesting room for things like that. And again, we were just trying to do sort of the simplest, most minimal thing and show that it works. I'm really hoping that people, sort of the agent community builds things like that on top of our models. That's, again, why we released these tools. We're not going to go and do lots more submissions to SWE-Bench and try to prompt engineer this and build a bigger system. We want people to like the ecosystem to do that on top of our models. But yeah, so I think that's a really interesting one.Swyx [00:48:32]: It turns out, I think you did do 3.5 Haiku with your tools and it scored a 40.6. Yes.Erik [00:48:38]: So it did very well. It itself is actually very smart, which is great. But we haven't done any experiments with this combination of the two models. But yeah, I think that's one of the exciting things is that how well Haiku 3.5 did on SWE-Bench shows that sort of even our smallest, fastest model is very good at sort of thinking agentically and working on hard problems. Like it's not just sort of for writing simple text anymore.Alessio [00:49:02]: And I know you're not going to talk about it, but like Sonnet is not even supposed to be the best model, you know? Like Opus, it's kind of like we left it at three back in the corner intro. At some point, I'm sure the new Opus will come out. And if you had Opus Plus on it, that sounds very, very good.Swyx [00:49:19]: There's a run with SuiteAgent plus Opus, but that's the official SWE-Bench guys doing it.Erik [00:49:24]: That was the older, you know, 3.0.Swyx [00:49:25]: You didn't do yours. Yeah. Okay. Did you want to? I mean, you could just change the model name.Erik [00:49:31]: I think we didn't submit it, but I think we included it in our model card.Swyx [00:49:35]: Okay.Erik [00:49:35]: We included the score as a comparison. Yeah.Swyx [00:49:38]: Yeah.Erik [00:49:38]: And Sonnet and Haiku, actually, I think the new ones, they both outperformed the original Opus. Yeah. I did see that.Swyx [00:49:44]: Yeah. It's a little bit hard to find. Yeah.Erik [00:49:47]: It's not an exciting score, so we didn't feel like they need to submit it to the benchmark.Swyx [00:49:52]: We can cut over to computer use if we're okay with moving on to topics on this, if anything else. I think we're good.Erik [00:49:58]: I'm trying to think if there's anything else SWE-Bench related.Swyx [00:50:02]: It doesn't have to be also just specifically SWE-Bench, but just your thoughts on building agents, because you are one of the few people that have reached this leaderboard on building a coding agent. This is the state of the art. It's surprisingly not that hard to reach with some good principles. Right. There's obviously a ton of low-hanging fruit that we covered. Your thoughts on if you were to build a coding agent startup, what next?Erik [00:50:24]: I think the really interesting question for me, for all the startups out there, is this kind of divergence between the benchmarks and what real customers will want. So I'm curious, maybe the next time you have a coding agent startup on the podcast, you should ask them that. What are the differences that they're starting to make? Tomorrow.Swyx [00:50:40]: Oh, perfect, perfect. Yeah.Erik [00:50:41]: I'm actually very curious what they will see, because I also have seen, I feel like it's slowed down a little bit if I don't see the startups submitting to SWE-Bench that much anymore.Swyx [00:50:52]: Because of the traces, the trace. So we had Cosign on, they had a 50-something on full, on SWE-Bench full, which is the hardest one, and they were rejected because they didn't want to submit their traces. Yep. IP, you know? Yeah, that makes sense, that makes sense. Actually, tomorrow we're talking to Bolt, which is a cloud customer. You guys actually published a case study with them. I assume you weren't involved with that, but they were very happy with Cloud. Cool. One of the biggest launches of the year. Yeah, totally. We actually happened to be sitting in Adept's former office. My take on this is Anthropic shipped Adept as a feature. It's still a beta feature, but yes. What was it like when you tried it for the first time? Was it obvious that Cloud had reached that stage where you could do computer use? It was somewhat of a surprise to me.Erik [00:51:40]: I had been on vacation, and I came back, and everyone's like, computer use works. So it was this very exciting moment. After the first go to Google, I think I tried to have it play Minecraft or something, and it actually installed and opened Minecraft.Swyx [00:51:54]: I was like, wow, this is pretty cool.Erik [00:51:55]: So I was like, wow, yeah, this thing can actually use a computer. And certainly, it is still beta. There's certain things that it's not very good at yet. But I'm really excited, I think, most broadly, not just for new things that weren't possible before, but as a much lower friction way to implement tool use. One anecdote from my days at Cobalt Robotics, we wanted our robots to be able to ride elevators, to go between floors and fully cover a building. The first way that we did this was doing API integrations with the elevator companies. Some of them actually had APIs. We could send a request, and it would move the elevator. Each new company we did took six months to do,Swyx [00:52:37]: because they were very slow.Erik [00:52:39]: They didn't really care.Swyx [00:52:40]: Or an elevator, not an API.Erik [00:52:42]: Even installing, once we had it with the company, they would have to literally go install an API box on the elevator that we wanted to use, and that would sometimes take six months.Swyx [00:52:51]: So very slow.Erik [00:52:52]: And eventually, we're like, okay, this is slowing down all of our customer deployments. And I was like, what if we just add an arm to the robot? And I added this little arm that could literally go and press the elevator buttons, and we use computer vision to do this. And we could deploy that in a single day, and have the robot being able to use the elevators. At the same time, it was slower than the API. It wasn't quite as reliable. Sometimes it would miss, and it would have to try to press it again.Swyx [00:53:20]: But it would get there.Erik [00:53:20]: But it was slower and a little bit less reliable. And I kind of see this as an analogy to computer use, of anything you can do with computer use today, you could probably write tool use and integrate it with APIs.Swyx [00:53:33]: It's up to the language model.Erik [00:53:34]: But that's going to take a bunch of software engineering to write those integrations.Swyx [00:53:38]: You have to do all this stuff.Erik [00:53:39]: With computer use, just give the thing a browser that's logged into what you want to integrate with, and it's going to work immediately. And I see that reduction in friction as being incredibly exciting. Imagine a customer support team where, okay, hey, you got this customer support bot, but you need to go integrate it with all these things. And you don't have any engineers on your customer support team. But if you can just give the thing a browser that's logged into your systems that you need it to have access to, now, suddenly, in one day, you could be up and rolling with a fully integrated customer service bot that could go do all the actions you care about. So I think that's the most exciting thing for me about computer use, is reducing that friction of integrations to almost zero.Alessio [00:54:20]: Or farming on World of Warcraft.Swyx [00:54:23]: Yes, or that.Erik [00:54:23]: Just go computer use.Alessio [00:54:25]: Very high-value use cases.Swyx [00:54:27]: I always say about this, this is the oldest question in robotics or self-driving, which is, do you drive by vision or do you have special tools? And vision is the universal tool to claim all tools. There's trade-offs, but there's situations in which that will come. But this week's podcast, the one that we just put out, had Stan Polu from Dust saying that he doesn't see a future where it's the significant workhorse. I think there could be a separation between maybe the high-volume use cases. You want APIs. And then the long tail, you want computer use. I totally agree. Right?Erik [00:55:00]: Or you'll start, you'll prototype something with computer use. And then, hey, this is working. Customers have adopted this feature. OK, let's go turn it into an API. And it'll be faster and use less tokens.Swyx [00:55:11]: I'd be interested to see a computer use agent replace itself by figuring out the API and then just dropping out of the equation altogether.Erik [00:55:20]: Yeah, that's really fun, actually.Swyx [00:55:22]: If I was running an RPA company, you would have the RPA scripting. RPA, for people listening, is robotic process automation, where you would script things that always show up in sequence. So you don't have an LLM in the loop. And so basically what you need to do is train an LLM to code that script. And then you can naturally hand off from computer use to non-computer use.Erik [00:55:43]: Or have some way to turn Claude's actions of computer use into a saved script that you can then run repeatedly.Swyx [00:55:49]: Yeah, it'd be interesting to record that.Alessio [00:55:50]: Why did you decide to not ship any sandbox harness for computer use? It's kind of like, hey, peace.Swyx [00:55:58]: Run at your own risk. It's Docker, right?Erik [00:55:59]: No, no, we launched it with, I think, a VM or Docker, a Docker as system.Alessio [00:56:03]: But it's not for your actual computer, right? The Docker instance runs in the Docker. It's not for...Swyx [00:56:10]: Yeah, it runs its own browser.Erik [00:56:13]: I mean, the main reason for that, one, is sort of security. We don't want... The model can do anything. So we wanted to give it a sandbox, not have people do their own computer. At least sort of for our default experience. We really care about providing a nice sort of... Making the default safe, I think, is the best way for us to do it. And I mean, very quickly, people made modifications to let you run it on your own desktop. And that's fine.Swyx [00:56:37]: Someone else can do that.Erik [00:56:37]: But we don't want that to be the official, anthropic thing to run. I would say also, from a product perspective, right now, because this is sort of still in beta, I think a lot of the most useful use cases are... Like, a sandbox is actually what you want. You want something where, hey, it can't mess up anything in here. It only has what I gave it. Also, if it's using your computer, you know, you can't use your computer at the same time. I think you actually want it to have its own screen. It's like you and a person pair programming, but only on one laptop versus you have two laptops.Swyx [00:57:07]: Everyone should totally have a side laptop where the computer uses... Cloud is just doing its thing. Yeah, yeah.Erik [00:57:11]: I think it's such a better experience. Unless there's something very explicit you want it to do for you on your own computer.Swyx [00:57:17]: It becomes like you're sort of shelling into a remote machine and, you know, maybe checking in on it every now and then. Like, I have fond memories of... Half our audience is going to be too young to remember this, but Citrix desktop experience, like, you were sort of remote into a machine that someone else was operating. And for a long time, that would be how you did, like, enterprise computing. Yeah, yeah. It's coming back. Any other implications of computer use? You know, is it a fun demo or is it, like, the future of Anthropic? I'm very excited about it.Erik [00:57:50]: I think that, like, there's a lot of sort of very repetitive work that, like, computer use will be great for. I think I've seen some examples of people build, like, coding agents that then also, like, test the front end that they made. So I think it's very cool to, like, use computer use to be able to close the loop on a lot of things that right now just a terminal-based agent can't do. So I think that's very exciting.Swyx [00:58:11]: It's kind of like end-to-end testing. Exactly. Yeah, yeah.Erik [00:58:14]: The end sort of front-end and web testing is something I'm very excited about.Swyx [00:58:18]: Yeah, I've seen Amanda also talking... This would be Amanda Askell, the head of Cloud Character. She goes on a lunch break and it generates, you know, research ideas for her. Giving it a name like computer use is very practical. It's like you're supposed to do things, but maybe sometimes it's not about doing things, it's about thinking. And thinking... In the process of thinking, you're using the computer. In some way that's, you know, solving SweetBench, like, you should be allowed to use the internet or you should be allowed to use a computer to solve it and use your vision and use whatever. Like, we're just sort of shackling it with all these restrictions just because we want to play nice for a benchmark. But really, you know, a full AI will be able to do all these things. To think. Yeah, we'll definitely be able to. To reason. To Google and search for things.Erik [00:58:58]: Yeah, yeah. Pull down inspiration.Alessio [00:59:00]: Can we just do a... before we wrap, a robotics corner?Swyx [00:59:03]: Oh, yeah, yeah.Alessio [00:59:04]: People are always curious, especially with somebody that is not trying to hype their own company. What's the state of AI robotics? Under-hyped, over-hyped?Erik [00:59:12]: Yeah, and I'll say, like, these are my opinions, not Anthropic's. And again, coming from a place of a burned-out robotics founder, so take everything with a grain of salt. I would say on the positives, like, there is really sort of incredible progress that's happened in the last five years that I think will be a big unlock for robotics. The first is just general purpose language models. I mean, there was an old saying in robotics that if to fully describe your task is harder than to just do the task, you can never automate it. Because, like, it's going to take more effort to even tell the robot how to do this thing than to me just do it itself. LLM solved that. I no longer need to go exhaustively program in every little thing I could do. The thing just has common sense. And it's going to know, how do I make a Reuben sandwich? I'm not going to have to go program that in. Whereas before, like, the idea of even, like, a cooking thing, it's like, oh god, like, we're gonna have the team of engineers that are hard coding recipes for the long tail of anything. It would be a disaster. So I think that's one thing, is that bringing common sense really is, like, solves this huge problem of describing tasks. The second big innovation has been diffusion models for path planning. A lot of this work came out of Toyota Research. There's a lot of startups now that are working on this, like Physical Intelligence Pi, Chelsea Finn's startup out of Stanford. And the basic idea here is using a little bit of the, I'd say maybe more inspiration from diffusion rather than diffusion models themselves. But they're a way to basically learn an end-to-end sort of motion control. Whereas previously, all of robotics motion control was sort of very hard-coded. You either, you know, you're programming in explicit motions, or you're programming in an explicit goal and using an optimization library to find the shortest path to it. This is now something where you just give it a bunch of demonstrations. And again, just like using learning, it's basically like learning from these examples. What does it mean to go pick up a cup? And doing these in a way just like diffusion models, where they are somewhat conditioned by text, you can have the same model learn many different tasks. And then the hope is that these start to generalize. That if you've trained it on picking up coffee cups and picking up books, then when I say pick up the backpack, it knows how to do that too. Even though you've never trained it on that. That's kind of the holy grail here, is that you train it on 500 different tasks, and then that's enough to really get it to generalize to do anything you would need. I think that's like still a big TBD. And these people are working, have like measured some degree of generalization. But at the end of the day, it's also like LLMs. Like, you know, do you really care about the thing, being able to do something that no one has ever shown in training data? People for like a home robot, there's going to be like a hundred things that people really wanted to do. And you can just make sure it has good training for those things. What you do care about then is like generalization within a task of, oh, I've never seen this particular coffee mug before. Can I still pick it up? And those, the models do seem very good at. So these kind of are the two big things that are going for robotics right now, is LLMs for common sense and diffusion-inspired path planning algorithms. I think this is very promising, but I think there's a lot of hype. And I think where we are right now is where self-driving cars were 10 years ago. I think we have very cool demos that work. I mean, 10 years ago, you had videos of people driving a car on the highway, driving a car, you know, on a street with a safety driver. But it's really taken a long time to go from there to, I took a Waymo here today. And even Waymo is only in SF and a few other cities. And I think it takes a long time for these things to actually get everywhere and to get all the edge cases covered. I think that for robotics, the limiting factor is going to be reliability, that these models are really good at doing these demos of doing laundry or doing dishes. If they only work 99% of the time, that sounds good, but that's actually really annoying. Humans are really good at these tasks. Imagine if one out of every 100 dishes, it washed, it breaks. You would not want that robot in your house, or you certainly wouldn't want that in your factory if one of every 100 boxes that it moves, it drops and breaks things inside it. So I think for these things to really be useful, they're going to have to hit a very, very high level of reliability, just like self-driving cars. And I don't know how hard it's going to be for these models to move from the 95% reliability to 99.9. I think that's going to be the big thing. And I think also, I'm a little skeptical of how good the unit economics of these things will be. These robots are going to be very expensive to build. And if you're just trying to replace labor, like a one-for-one purchase, it kind of sets an upper cap about how much you can charge. And so it seems like it's not that great a business. I'm also worried about that for the self-driving car industry.Alessio [01:04:05]: Do you see most of the applications actually taking some of the older, especially manufacturing machinery, which needs to be very precise? Even if it's off by just a few millimeters, it cannot screw up the whole thing and be able to adjust at the edge? Or do you think the net new use cases may be more interesting?Erik [01:04:24]: I think it'd be very hard to replace a lot of those traditional manufacturing robots because everything relies on that precision. If you have a model that can, again, only get there 99% of the time, you don't want 1% of your cars to have the weld in the wrong spot. That's going to be a disaster. And a lot of manufacturing is all about getting rid of as much variance and uncertainty asSwyx [01:04:47]: possible.Erik [01:04:47]: Yeah.Swyx [01:04:48]: And what about the hardware?Alessio [01:04:49]: A lot of my friends that work in robotics, one of their big issues is sometimes you just have a servo that fails, and it takes a bunch of time to fix that.Swyx [01:04:57]: Is that holding back things?Alessio [01:04:58]: Or is the software still, anyway, not that ready?Swyx [01:05:01]: I think both.Erik [01:05:01]: I think there's been a lot more progress in the software in the last few years. And I think a lot of the humanoid robot companies now are really trying to build amazing hardware. Hardware is just so hard. It's something where you build your first robot, and it works. You're like, great. Then you build 10 of them. Five of them work. Three of them work half the time. Two of them don't work. And you built them all the same, and you don't know why. And it's just like the real world has this level of detail and differences that softwareSwyx [01:05:28]: doesn't have.Erik [01:05:29]: Imagine if every for loop you wrote, some of them just didn't work. Some of them were slower than others. How do you deal with that? Imagine if every binary that you shipped to a customer, each of those four loops was aSwyx [01:05:41]: little different.Erik [01:05:41]: It becomes just so hard to scale and maintain quality of these things. And I think that's what makes hardware really hard. It's not building one of something, but repeatedly building something and making it work reliably. Where again, you'll buy a batch of 100 motors, and each of those motors will behave a little bit differently to the same input command.Swyx [01:06:01]: This is your lived experience at Cobalt.Erik [01:06:03]: And robotics is all about how do you build something that's robust despite these differences.Swyx [01:06:08]: We can't get the tolerance of motors down to-Erik [01:06:10]: It's just everything.Swyx [01:06:13]: It's actually everything.Alessio [01:06:14]: Yeah.Erik [01:06:15]: No, I mean, one of my horror stories was that at Cobalt, this was many years ago, we had a thermal camera on the robot that had a USB connection to the computer inside, which is, first of all, is a big mistake. You're not supposed to use a USB. It is not a reliable protocol. It's designed that if there's mistakes, the user can just unplug it and plug it back in. I see. And so typically things that are USB, they're not designed to the same level of very high reliability you need. Again, because they assume someone will just unplug it and replug it back in. You just say someone sometime.Swyx [01:06:46]: I heard this too, and I didn't listen to it.Erik [01:06:47]: I really wish I had before. Anyway, at a certain point, a bunch of these thermal cameras started failing, and we couldn't figure out why. And I asked everyone on the team, like, hey, what's changed? Did the software change around this? Did the hardware design change around this? And I was investigating all this stuff, looking at kernel logs of what's happening with thisSwyx [01:07:07]: thing.Erik [01:07:07]: And finally, the procurement person was like, oh, yeah, well, I found this new vendor for USB cables last summer.Swyx [01:07:14]: And I'm like, what?Erik [01:07:15]: You switched which vendor were buying USB cables? I'm like, yeah, it's the same exact cable. It's just a dollar cheaper. And it turns out this was the problem. This new cable had slightly worse resistance or slightly worse EMI interference. And it worked most of the time. But 1% of the time, these cameras would fail, and we'd need to reboot a big part of the system. And it was all just because the same exact spec, these two different USB cables, slightly different. And so these are the kind of things you deal with with hardware.Swyx [01:07:45]: For listeners, we had an episode with Josh Albrecht in BU where he talked about buying tens of thousands of GPUs. And just some of them will just not do math. Yeah, that's the same thing. You run some tests to find the bad batch, and then you return it to sender because they just, GPUs won't do math, right? Yeah, yeah, this is the thing.Erik [01:08:05]: The real world has this level of detail. Eric Jang, he did AI at Google.Swyx [01:08:11]: Yeah, 1X. Yeah, and then joined 1X.Erik [01:08:13]: I see him post on Twitter occasionally of complaints about hardware and supply chain. And we know each other, and we joke occasionally. I went from robotics into AI, and he went from AI into robotics.Swyx [01:08:26]: I mean, look, very, very promising. The time of the real world is unlimited, right? But just also a lot harder. And yeah, I do think something I also tell people about for why working software agents is they're infinitely clonable. Yeah, they always work the same way. Mostly, unless you're using Python. And yeah, I mean, this is the whole thesis. I'm also interested, you dropped a little bit of alpha there. I don't want to make sure we don't lose it. Like, you're kind of skeptical about self-driving as a business. So I want to double click on this a little bit, because I mean, I think that shouldn't be taken away. We do have some public Waymo numbers. Read from Waymo is pretty public with their stats. They're exceeding 100 Waymo trips a week. If you assume a 25𝑟𝑖𝑑𝑒𝑎𝑣𝑒𝑟𝑎𝑔𝑒,𝑡ℎ𝑎𝑡′𝑠25rideaverage,that′s130 million revenue run rate. At some point, they will recoup their investment, right? Like, what are we talking about here? Way to skepticism.Erik [01:09:21]: I think, and again, I'm not an expert. I don't know their financials. I would say the thing I'm worried about is compared to an Uber, I don't know how much an Uber driver takes home a year, but call that the revenue that a Waymo is going to be making in that same year. Those cars are expensive. It's not about if you can hit profitability, it's about your cash conversion cycles. Is building one Waymo, how cheap can you make that compared to how much you're earning as the equivalent of what an Uber driver would take home? Because remember, an Uber driver, you're not getting that whole revenue. You think about, for the Uber driver, the cost of the car, the depreciation of the car. I'm not convinced how much profit Waymo can actually make per car.Swyx [01:10:02]: That's, I think, my skepticism.Alessio [01:10:02]: Well, they need to pre-assess the run Waymo because the Class C is like $110 grand, somethingSwyx [01:10:09]: like that, plus the LiDAR. That's many years of, yeah, yeah, yeah. Exactly, exactly. Anything else?Alessio [01:10:14]: Parting thoughts? Call to action? Rants?Swyx [01:10:18]: The floor is yours.Erik [01:10:19]: I'm very excited to see a lot more LLM agents out there in the world doing things. And I think they'll be, the biggest limiting thing will start to become, do people trust the output of these agents? And how do you trust the output of an agent that did five hours of work for you and is coming back with something? And if you can't find some way to trust that agent's work, it kind of wasn't valuable at all. So I think that's going to be a really important thing, is not just doing the work, but doing the work in a trustable, auditable way where you can also explain to the human, hey, here's exactly how this works and why and how I came to it. I think that's going to be really important.Swyx [01:10:54]: Thank you so much. Yeah, thanks. This was great. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    1:11:10
  • Why Compound AI + Open Source will beat Closed AI
    We have a full slate of upcoming events: AI Engineer London, AWS Re:Invent in Las Vegas, and now Latent Space LIVE! at NeurIPS in Vancouver and online. Sign up to join and speak!We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here for a chance to appear on the show!We try to stay close to the inference providers as part of our coverage, as our podcasts with Together AI and Replicate will attest: However one of the most notable pull quotes from our very well received Braintrust episode was his opinion that open source model adoption has NOT gone very well and is actually declining in relative market share terms (it is of course increasing in absolute terms):Today’s guest, Lin Qiao, would wholly disagree. Her team of Pytorch/GPU experts are wholly dedicated toward helping you serve and finetune the full stack of open source models from Meta and others, across all modalities (Text, Audio, Image, Embedding, Vision-understanding), helping customers like Cursor and Hubspot scale up open source model inference both rapidly and affordably.Fireworks has emerged after its successive funding rounds with top tier VCs as one of the leaders of the Compound AI movement, a term first coined by the Databricks/Mosaic gang at Berkeley AI and adapted as “Composite AI” by Gartner:Replicating o1We are the first podcast to discuss Fireworks’ f1, their proprietary replication of OpenAI’s o1. This has become a surprisingly hot area of competition in the past week as both Nous Forge and Deepseek r1 have launched competitive models.Full Video PodcastLike and subscribe!Timestamps* 00:00:00 Introductions* 00:02:08 Pre-history of Fireworks and PyTorch at Meta* 00:09:49 Product Strategy: From Framework to Model Library* 00:13:01 Compound AI Concept and Industry Dynamics* 00:20:07 Fireworks' Distributed Inference Engine* 00:22:58 OSS Model Support and Competitive Strategy* 00:29:46 Declarative System Approach in AI* 00:31:00 Can OSS replicate o1?* 00:36:51 Fireworks f1* 00:41:03 Collaboration with Cursor and Speculative Decoding* 00:46:44 Fireworks quantization (and drama around it)* 00:49:38 Pricing Strategy* 00:51:51 Underrated Features of Fireworks Platform* 00:55:17 HiringTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, partner at CTO at Danceable Partners, and I'm joined by my co-host, Swyx founder, Osmalayar.Swyx [00:00:11]: Hey, and today we're in a very special studio inside the Fireworks office with Lin Qiang, CEO of Fireworks. Welcome. Yeah.Lin [00:00:20]: Oh, you should welcome us.Swyx [00:00:21]: Yeah, welcome. Yeah, thanks for having us. It's unusual to be in the home of a startup, but it's also, I think our relationship is a bit unusual compared to all our normal guests. Definitely.Lin [00:00:34]: Yeah. I'm super excited to talk about very interesting topics in that space with both of you.Swyx [00:00:41]: You just celebrated your two-year anniversary yesterday.Lin [00:00:43]: Yeah, it's quite a crazy journey. We circle around and share all the crazy stories across these two years, and it has been super fun. All the way from we experienced Silicon Valley bank run to we delete some data that shouldn't be deleted operationally. We went through a massive scale where we actually are busy getting capacity to, yeah, we learned to kind of work with it as a team with a lot of brilliant people across different places to join a company. It has really been a fun journey.Alessio [00:01:24]: When you started, did you think the technical stuff will be harder or the bank run and then the people side? I think there's a lot of amazing researchers that want to do companies and it's like the hardest thing is going to be building the product and then you have all these different other things. So, were you surprised by what has been your experience the most?Lin [00:01:42]: Yeah, to be honest with you, my focus has always been on the product side and then after the product goes to market. And I didn't realize the rest has been so complicated, operating a company and so on. But because I don't think about it, I just kind of manage it. So it's done. I think I just somehow don't think about it too much and solve whatever problem coming our way and it worked.Swyx [00:02:08]: So let's, I guess, let's start at the pre-history, the initial history of Fireworks. You ran the PyTorch team at Meta for a number of years and we previously had Sumit Chintal on and I think we were just all very interested in the history of GenEI. Maybe not that many people know how deeply involved Faire and Meta were prior to the current GenEI revolution.Lin [00:02:35]: My background is deep in distributed system, database management system. And I joined Meta from the data side and I saw this tremendous amount of data growth, which cost a lot of money and we're analyzing what's going on. And it's clear that AI is driving all this data generation. So it's a very interesting time because when I joined Meta, Meta is going through ramping down mobile-first, finishing the mobile-first transition and then starting AI-first. And there's a fundamental reason about that sequence because mobile-first gave a full range of user engagement that has never existed before. And all this user engagement generated a lot of data and this data power AI. So then the whole entire industry is also going through, falling through this same transition. When I see, oh, okay, this AI is powering all this data generation and look at where's our AI stack. There's no software, there's no hardware, there's no people, there's no team. I want to dive up there and help this movement. So when I started, it's very interesting industry landscape. There are a lot of AI frameworks. It's a kind of proliferation of AI frameworks happening in the industry. But all the AI frameworks focus on production and they use a very certain way of defining the graph of neural network and then use that to drive the model iteration and productionization. And PyTorch is completely different. So they could also assume that he was the user of his product. And he basically says, researchers face so much pain using existing AI frameworks, this is really hard to use and I'm going to do something different for myself. And that's the origin story of PyTorch. PyTorch actually started as the framework for researchers. They don't care about production at all. And as they grow in terms of adoption, so the interesting part of AI is research is the top of our normal production. There are so many researchers across academic, across industry, they innovate and they put their results out there in open source and that power the downstream productionization. So it's brilliant for MATA to establish PyTorch as a strategy to drive massive adoption in open source because MATA internally is a PyTorch shop. So it creates a flying wheel effect. So that's kind of a strategy behind PyTorch. But when I took on PyTorch, it's kind of at Caspo, MATA established PyTorch as the framework for both research and production. So no one has done that before. And we have to kind of rethink how to architect PyTorch so we can really sustain production workload, the stability, reliability, low latency, all this production concern was never a concern before. Now it's a concern. And we actually have to adjust its design and make it work for both sides. And that took us five years because MATA has so many AI use cases, all the way from ranking recommendation as powering the business top line or as ranking newsfeed, video ranking to site integrity detect bad content automatically using AI to all kinds of effects, translation, image classification, object detection, all this. And also across AI running on the server side, on mobile phones, on AI VR devices, the wide spectrum. So by the time we actually basically managed to support AI across ubiquitous everywhere across MATA. But interestingly, through open source engagement, we work with a lot of companies. It is clear to us like this industry is starting to take on AI first transition. And of course, MATA's hyperscale always go ahead of industry. And it feels like when we start this AI journey at MATA, there's no software, no hardware, no team. For many companies we engage with through PyTorch, we feel the pain. That's the genesis why we feel like, hey, if we create fireworks and support industry going through this transition, it will be a huge amount of impact. Of course, the problem that the industry is facing will not be the same as MATA. MATA is so big, right? So it's kind of skewed towards extreme scale and extreme optimization in the industry will be different. But we feel like we have the technical chop and we've seen a lot. We'll look to kind of drive that. So yeah, so that's how we started.Swyx [00:06:58]: When you and I chatted about the origins of fireworks, it was originally envisioned more as a PyTorch platform, and then later became much more focused on generative AI. Is that fair to say? What was the customer discovery here?Lin [00:07:13]: Right. So I would say our initial blueprint is we should build a PyTorch cloud because a PyTorch library and there's no SaaS platform to enable AI workloads.Swyx [00:07:26]: Even in 2022, it's interesting.Lin [00:07:28]: I would not say absolutely no, but cloud providers have some of those, but it's not first class citizen, right? At 2022, there's still like TensorFlow is massively in production. And this is all pre-gen AI, and PyTorch is kind of getting more and more adoption. But there's no PyTorch-first SaaS platform existing. At the same time, we are also a very pragmatic set of people. We really want to make sure from the get-go, we get really, really close to customers. We understand their use case, we understand their pain points, we understand the value we deliver to them. So we want to take a different approach instead of building a horizontal PyTorch cloud. We want to build a verticalized platform first. And then we talk with many customers. And interestingly, we started the company in September 2022, and in October, November, the OpenAI announced ChatGPT. And then boom, when we talked with many customers, they were like, can you help us work on the JNS aspect? So of course, there are some open source models. It's not as good at that time, but people are already putting a lot of attention there. Then we decided that if we're going to pick a vertical, we're going to pick JNI. The other reason is all JNI models are PyTorch models. So that's another reason. We believe that because of the nature of JNI, it's going to generate a lot of human consumable content. It will drive a lot of consumer, customer-developer-facing application and product innovation. Guaranteed. We're just at the beginning of this. Our prediction is for those kind of applications, the inference is much more important than training because inference scale is proportional to the up-limit award population. And training scale is proportional to the number of researchers. Of course, each training round could be very expensive. Although PyTorch supports both inference and training, we decided to laser focus on inference. So yeah, so that's how we got started. And we launched our public platform August last year. When we launched, it was a single product. It's a distributed inference engine with a simple API, open AI compatible API with many models. We started with LM and then we added a lot of models. Fast forward to now, we are a full platform with multiple product lines. So we love to kind of dive deep into what we offer. But that's a very fun journey in the past two years.Alessio [00:09:49]: What was the transition from you start to focus on PyTorch and people want to understand the framework, get it live. And now say maybe most people that use you don't even really know much about PyTorch at all. You know, they're just trying to consume a model. From a product perspective, like what were some of the decisions early on? Like right in October, November, you were just like, hey, most people just care about the model, not about the framework. We're going to make it super easy or was it more a gradual transition to the model librarySwyx [00:10:16]: you have today?Lin [00:10:17]: Yeah. So our product decision is all based on who is our ICP. And one thing I want to acknowledge here is the generic technology is disruptive. It's very different from AI before GNI. So it's a clear leap forward. Because before GNI, the companies that want to invest in AI, they have to train from scratch. There's no other way. There's no foundation model. It doesn't exist. So that means then to start a team, first hire a team who is capable of crunch data. There's a lot of data to crunch, right? Because training from scratch, you have to prepare a lot of data. And then they need to have GPUs to train, and then you start to manage GPUs. So then it becomes a very complex project. It takes a long time and not many companies can afford it, actually. And the GNI is a very different game right now, because it is a foundation model. So you don't have to train anymore. That makes AI much more accessible as a technology. As an app developer or product manager, even, not a developer, they can interact with GNI models directly. So our goal is to make AI accessible to all app developers and product engineers. That's our goal. So then getting them into the building model doesn't make any sense anymore with this new technology. And then building easy, accessible APIs is the most important. Early on, when we got started, we decided we're going to be open AI compatible. It's just kind of very easy for developers to adopt this new technology, and we will manage the underlying complexity of serving all these models.Swyx [00:11:56]: Yeah, open AI has become the standard. Even as we're recording today, Gemini announced that they have open AI compatible APIs. Interesting. So we just need to drop it all in line, and then we have everyone popping in line.Lin [00:12:09]: That's interesting, because we are working very closely with Meta as one of the partners. Meta, of course, is kind of very generous to donate many very, very strong open source models, expecting more to come. But also they have announced LamaStack, which is basically standardized, the upper level stack built on top of Lama models. So they don't just want to give out models and you figure out what the upper stack is. They instead want to build a community around the stack and build a new standard. I think there's an interesting dynamics in play in the industry right now, when it's more standardized across open AI, because they are kind of creating the top of the funnel, or standardized across Lama, because this is the most used open source model. So I think it's a lot of fun working at this time.Swyx [00:13:01]: I've been a little bit more doubtful on LamaStack, I think you've been more positive. Basically it's just like the meta version of whatever Hugging Face offers, you know, or TensorRT, or BLM, or whatever the open source opportunity is. But to me, it's not clear that just because Meta open sources Lama, that the rest of LamaStack will be adopted. And it's not clear why I should adopt it. So I don't know if you agree.Lin [00:13:27]: It's very early right now. That's why I kind of work very closely with them and give them feedback. The feedback to the meta team is very important. So then they can use that to continue to improve the model and also improve the higher level I think the success of LamaStack heavily depends on the community adoption. And there's no way around it. And I know the meta team would like to kind of work with a broader set of community. But it's very early.Swyx [00:13:52]: One thing that after your Series B, so you raced for Benchmark, and then Sequoia. I remember being close to you for at least your Series B announcements, you started betting heavily on this term of Compound AI. It's not a term that we've covered very much in the podcast, but I think it's definitely getting a lot of adoption from Databricks and Berkeley people and all that. What's your take on Compound AI? Why is it resonating with people?Lin [00:14:16]: Right. So let me give a little bit of context why we even consider that space.Swyx [00:14:22]: Because like pre-Series B, there was no message, and now it's like on your landing page.Lin [00:14:27]: So it's kind of very organic evolution from when we first launched our public platform, we are a single product. We are a distributed inference engine, where we do a lot of innovation, customized KUDA kernels, raw kernel kernels, running on different kinds of hardware, and build distributed disaggregated execution, inference execution, build all kinds of caching. So that is one. So that's kind of one product line, is the fast, most cost-efficient inference platform. Because we wrote PyTorch code, we know we basically have a special PyTorch build for that, together with a custom kernel we wrote. And then we worked with many more customers, we realized, oh, the distributed inference engine, our design is one size fits all. We want to have this inference endpoint, then everyone come in, and no matter what kind of form and shape or workload they have, it will just work for them. So that's great. But the reality is, we realized all customers have different kinds of use cases. The use cases come in all different forms and shapes. And the end result is the data distribution in their inference workload doesn't align with the data distribution in the training data for the model. It's a given, actually. If you think about it, because researchers have to guesstimate what is important, what's not important in preparing data for training. So because of that misalignment, then we leave a lot of quality, latency, cost improvement on the table. So then we're saying, OK, we want to heavily invest in a customization engine. And we actually announced it called FHIR Optimizer. So FHIR Optimizer basically helps users navigate a three-dimensional optimization space across quality, latency, and cost. So it's a three-dimensional curve. And even for one company, for different use cases, they want to land in different spots. So we automate that process for our customers. It's very simple. You have your inference workload. You inject into the optimizer along with the objective function. And then we spit out inference deployment config and the model setup. So it's your customized setup. So that is a completely different product. So that product thinking is one size fits all. And now on top of that, we provide a huge variety of state-of-the-art models, hundreds of them, varying from text to large state-of-the-art English models. That's where we started. And as we talk with many customers, we realize, oh, audio and text are very, very close. Many of our customers start to build assistants, all kinds of assistants using text. And they immediately want to add audio, audio in, audio out. So we support transcription, translation, speech synthesis, text, audio alignment, all different kinds of audio features. It's a big announcement. You should have heard by the time this is out. And the other areas of vision and text are very close with each other. Because a lot of information doesn't live in plain text. A lot of information lives in multimedia format, images, PDFs, screenshots, and many other different formats. So oftentimes to solve a problem, we need to put the vision model first to extract information and then use language model to process and then send out results. So vision is important. We also support vision model, various different kinds of vision models specialized in processing different kinds of source and extraction. And we're also going to have another announcement of a new API endpoint we'll support for people to upload various different kinds of multimedia content and then get the extract very accurate information out and feed that into LM. And of course, we support embedding because embedding is very important for semantic search, for RAG, and all this. And in addition to that, we also support text-to-image, image generation models, text-to-image, image-to-image, and we're adding text-to-video as well in our portfolio. So it's a very comprehensive set of model catalog that built on top of File Optimizer and Distributed Inference Engine. But then we talk with more customers, they solve business use case, and then we realize one model is not sufficient to solve their problem. And it's very clear because one is the model hallucinates. Many customers, when they onboard this JNI journey, they thought this is magical. JNI is going to solve all my problems magically. But then they realize, oh, this model hallucinates. It hallucinates because it's not deterministic, it's probabilistic. So it's designed to always give you an answer, but based on probabilities, so it hallucinates. And that's actually sometimes a feature for creative writing, for example. Sometimes it's a bug because, hey, you don't want to give misinformation. And different models also have different specialties. To solve a problem, you want to ask different special models to kind of decompose your task into multiple small tasks, narrow tasks, and then have an expert model solve that task really well. And of course, the model doesn't have all the information. It has limited knowledge because the training data is finite, not infinite. So the model oftentimes doesn't have real-time information. It doesn't know any proprietary information within the enterprise. It's clear that in order to really build a compiling application on top of JNI, we need a compound AI system. Compound AI system basically is going to have multiple models across modalities, along with APIs, whether it's public APIs, internal proprietary APIs, storage systems, database systems, knowledge to work together to deliver the best answer.Swyx [00:20:07]: Are you going to offer a vector database?Lin [00:20:09]: We actually heavily partner with several big vector database providers. Which is your favorite? They are all great in different ways. But it's public information, like MongoDB is our investor. And we have been working closely with them for a while.Alessio [00:20:26]: When you say distributed inference engine, what do you mean exactly? Because when I hear your explanation, it's almost like you're centralizing a lot of the decisions through the Fireworks platform on the quality and whatnot. What do you mean distributed? It's like you have GPUs in a lot of different clusters, so you're sharding the inference across the same model.Lin [00:20:45]: So first of all, we run across multiple GPUs. But the way we distribute across multiple GPUs is unique. We don't distribute the whole model monolithically across multiple GPUs. We chop them into pieces and scale them completely differently based on what's the bottleneck. We also are distributed across regions. We have been running in North America, EMEA, and Asia. We have regional affinity to applications because latency is extremely important. We are also doing global load balancing because a lot of applications there, they quickly scale to global population. And then at that scale, different content wakes up at a different time. And you want to kind of load balancing across. So all the way, and we also have, we manage various different kinds of hardware skew from different hardware vendors. And different hardware design is best for different types of workload, whether it's long context, short context, long generation. So all these different types of workload is best fitted for different kinds of hardware skew. And then we can even distribute across different hardware for a workload. So the distribution actually is all around in the full stack.Swyx [00:22:02]: At some point, we'll show on the YouTube, the image that Ray, I think, has been working on with all the different modalities that you offer. To me, it's basically you offer the open source version of everything that OpenAI typically offers. I don't think there is. Actually, if you do text to video, you will be a superset of what OpenAI offers because they don't have Sora. Is that Mochi, by the way? Mochi. Mochi, right?Lin [00:22:27]: Mochi. And there are a few others. I will say, the interesting thing is, I think we're betting on the open source community is going to proliferate. This is literally what we're seeing. And there's amazing video generation companies. There is amazing audio companies. Like cross-border, the innovation is off the chart, and we are building on top of that. I think that's the advantage we have compared with a closed source company.Swyx [00:22:58]: I think I want to restate the value proposition of Fireworks for people who are comparing you versus a raw GPU provider like a RunPod or Lambda or anything like those, which is like you create the developer experience layer and you also make it easily scalable or serverless or as an endpoint. And then, I think for some models, you have custom kernels, but not all models.Lin [00:23:25]: Almost for all models. For all large language models, all your models, and the VRMs. Almost for all models we serve.Swyx [00:23:35]: And so that is called Fire Attention. I don't remember the speed numbers, but apparently much better than VLM, especially on a concurrency basis.Lin [00:23:44]: So Fire Attention is specific mostly for language models, but for other modalities, we'll also have a customized kernel.Swyx [00:23:51]: And I think the typical challenge for people is understanding that has value, and then there are other people who are also offering open-source models. Your mode is your ability to offer a good experience for all these customers. But if your existence is entirely reliant on people releasing nice open-source models, other people can also do the same thing.Lin [00:24:14]: So I would say we build on top of open-source model foundation. So that's the kind of foundation we build on top of. But we look at the value prop from the lens of application developers and product engineers. So they want to create new UX. So what's happening in the industry right now is people are thinking about a completely new way of designing products. And I'm talking to so many founders, it's just mind-blowing. They help me understand existing way of doing PowerPoint, existing way of coding, existing way of managing customer service. It's actually putting a box in our head. For example, PowerPoint. So PowerPoint generation is we always need to think about how to fit into my storytelling into this format of slide one after another. And I'm going to juggle through design together with what story to tell. But the most important thing is what's our storytelling lines, right? And why don't we create a space that is not limited to any format? And those kind of new product UX design combined with automated content generation through Gen AI is the new thing that many founders are doing. What are the challenges they're facing? Let's go from there. One is, again, because a lot of products built on top of Gen AI, they are consumer-personal developer facing, and they require interactive experience. It's just a kind of product experience we all get used to. And our desire is to actually get faster and faster interaction. Otherwise, nobody wants to spend time, right? And then that requires low latency. And the other thing is the nature of consumer-personal developer facing is your audience is very big. You want to scale up to product market fit quickly. But if you lose money at a small scale, you're going to bankrupt quickly. So it's actually a big contrast. I actually have product market fit, but when I scale, I scale out of my business. So that's kind of a very funny way to think about it. So then having low latency and low cost is essential for those new applications and products to survive and really become a generation company. So that's the design point for our distributed inference engine and the file optimizer. File optimizer, you can think about that as a feedback loop. The more you feed your inference workload to our inference engine, the more we help you improve quality, lower latency further, lower your cost. It basically becomes better. And we automate that because we don't want you as an app developer or product engineer to think about how to figure out all these low-level details. It's impossible because you're not trained to do that at all. You should kind of keep your focus on the product innovation. And then the compound AI, we actually feel a lot of pain as the app developers, engineers, there are so many models. Every week, there's at least a new model coming out.Swyx [00:27:09]: Tencent had a giant model this week. Yeah, yeah.Lin [00:27:13]: I saw that. I saw that.Swyx [00:27:15]: It's like $500 billion.Lin [00:27:18]: So they're like, should I keep chasing this or should I forget about it? And which model should I pick to solve what kind of sub-problem? How do I even decompose my problem into those smaller problems and fit the model into it? I have no idea. And then there are two ways to think about this design. I think I talked about that in the past. One is imperative, as in you figure out how to do it. You give developer tools to dictate how to do it. Or you build a declarative system where a developer tells what they want to do, not how. So these are completely two different designs. So the analogy I want to draw is, in the data world, the database management system is a declarative system because people use database, use SQL. SQL is a way you say, what do you want to extract out of a database? What kind of result do you want? But you don't figure out which node is going to, how many nodes you're going to run on top of, how you redefine your disk, which index you use, which project. You don't need to worry about any of those. And database management system will figure out, generate a new best plan, and execute on that. So database is declarative. And it makes it super easy. You just learn SQL, which is learn a semantic meaning of SQL, and you can use it. Imperative side is there are a lot of ETL pipelines. And people design this DAG system with triggers, with actions, and you dictate exactly what to do. And if it fails, then how to recover. So that's an imperative system. We have seen a range of systems in the ecosystem go different ways. I think there's value of both. There's value of both. I don't think one is going to subsume the other. But we are leaning more into the philosophy of the declarative system. Because from the lens of app developer and product engineer, that would be easiest for them to integrate.Swyx [00:29:07]: I understand that's also why PyTorch won as well, right? This is one of the reasons. Ease of use.Lin [00:29:14]: Focus on ease of use, and then let the system take on the hard challenges and complexities. So we follow, we extend that thinking into current system design. So another announcement is we will also announce our next declarative system is going to appear as a model that has extremely high quality. And this model is inspired by Owen's announcement for OpenAI. You should see that by the time we announce this or soon.Alessio [00:29:46]: Trained by you.Lin [00:29:47]: Yes.Alessio [00:29:48]: Is this the first model that you trained? It's not the first.Lin [00:29:52]: We actually have trained a model called FireFunction. It's a function calling model. It's our first step into compound AI system. Because function calling model can dispatch a request into multiple APIs. We have pre-baked set of APIs the model learned. You can also add additional APIs through the configuration to let model dispatch accordingly. So we have a very high quality function calling model that's already released. We have actually three versions. The latest version is very high quality. But now we take a further step that you don't even need to use function calling model. You use our new model we're going to release. It will solve a lot of problems approaching very high OpenAI quality. So I'm very excited about that.Swyx [00:30:41]: Do you have any benchmarks yet?Lin [00:30:43]: We have a benchmark. We're going to release it hopefully next week. We just put our model to LMSYS and people are guessing. Is this the next Gemini model or a MADIS model? People are guessing. That's very interesting. We're watching the Reddit discussion right now.Swyx [00:31:00]: I have to ask more questions about this. When OpenAI released o1, a lot of people asked about whether or not it's a single model or whether it's a chain of models. Noam and basically everyone on the Strawberry team was very insistent that what they did for reinforcement learning, chain of thought, cannot be replicated by a whole bunch of open source model calls. Do you think that that is wrong? Have you done the same amount of work on RL as they have or was it a different direction?Lin [00:31:29]: I think they take a very specific approach where the caliber of team is very high. So I do think they are the domain expert in doing the things they are doing. I don't think there's only one way to achieve the same goal. We're on the same direction in the sense that the quality scaling law is shifting from training to inference. For that, I fully agree with them. But we're taking a completely different approach to the problem. All of that is because, of course, we didn't train the model from scratch. All of that is because we built on the show of giants. The current model available we have access to is getting better and better. The future trend is the gap between the open source model and the co-source model. It's just going to shrink to the point there's not much difference. And then we're on the same level field. That's why I think our early investment in inference and all the work we do around balancing across quality, latency, and cost pay off because we have accumulated a lot of experience and that empowers us to release this new model that is approaching open-ended quality.Alessio [00:32:39]: I guess the question is, what do you think the gap to catch up will be? Because I think everybody agrees with open source models eventually will catch up. And I think with 4, then with Lama 3.2, 3.1, 4.5b, we close the gap. And then 0.1 just reopened the gap so much and it's unclear. Obviously, you're saying your model will have...Swyx [00:32:57]: We're closing that gap.Alessio [00:32:58]: But you think in the future, it's going to be months?Lin [00:33:02]: So here's the thing that's happened. There's public benchmark. It is what it is. But in reality, open source models in certain dimensions are already on par or beat closed source models. So for example, in the coding space, open source models are really, really good. And in function calling, file function is also really, really good. So it's all a matter of whether you build one model to solve all the problems and you want to be the best of solving all the problems, or in the open source domain, it's going to specialize. All these different model builders specialize in certain narrow area. And it's logical that they can be really, really good in that very narrow area. And that's our prediction is with specialization, there will be a lot of expert models really, really good and even better than one-size-fits-all closed source models.Swyx [00:33:55]: I think this is the core debate that I am still not 100% either way on in terms of compound AI versus normal AI. Because you're basically fighting the bitter lesson.Lin [00:34:09]: Look at the human society, right? We specialize. And you feel really good about someone specializing doing something really well, right? And that's how our way evolved from ancient times. We're all journalists. We do everything. Now we heavily specialize in different domains. So my prediction is in the AI model space, it will happen also. Except for the bitter lesson.Swyx [00:34:30]: You get short-term gains by having specialists, domain specialists, and then someone just needs to train like a 10x bigger model on 10x more inference, 10x more data, 10x more model perhaps, whatever the current scaling law is. And then it supersedes all the individual models because of some generalized intelligence slash world knowledge. I think that is the core insight of the GPTs, the GPT-123 networks. Right.Lin [00:34:56]: But the training scaling law is because you have an increasing amount of data to train from. And you can do a lot of compute. So I think on the data side, we're approaching the limit. And the only data to increase that is synthetic generated data. And then there's like what is the secret sauce there, right? Because if you have a very good large model, you can generate very good synthetic data and then continue to improve quality. So that's why I think in OpenAI, they are shifting from the training scaling law intoSwyx [00:35:25]: inference scaling law.Lin [00:35:25]: And it's the test time and all this. So I definitely believe that's the future direction. And that's where we are really good at, doing inference.Swyx [00:35:34]: A couple of questions on that. Are you planning to share your reasoning choices?Lin [00:35:39]: That's a very good question. We are still debating.Swyx [00:35:43]: Yeah.Lin [00:35:45]: We're still debating.Swyx [00:35:46]: I would say, for example, it's interesting that, for example, SweetBench. If you want to be considered for ranking, you have to submit your reasoning choices. And that has actually disqualified some of our past guests. Cosign was doing well on SweetBench, but they didn't want to leak those results. So that's why you don't see O1 preview on SweetBench, because they don't submit their reasoning choices. And obviously, it's IP. But also, if you're going to be more open, then that's one way to be more open. So your model is not going to be open source, right? It's going to be an endpoint that you provide. Okay, cool. And then pricing, also the same as OpenAI, just kind of based on...Lin [00:36:25]: Yeah, this is... I don't have, actually, information. Everything is going so fast, we haven't even thought about that yet. Yeah, I should be more prepared.Swyx [00:36:33]: I mean, this is live. You know, it's nice to just talk about it as it goes live. Any other things that you want feedback on or you're thinking through? It's kind of nice to just talk about something when it's not decided yet. About this new model. It's going to be exciting. It's going to generate a lot of buzz. Right.Lin [00:36:51]: I'm very excited to see how people are going to use this model. So there's already a Reddit discussion about it. And people are asking very deep, mathematical questions. And since the model got it right, surprising. And internally, we're also asking the model to generate what is AGI. And it generates a very complicated DAG thinking process. So we're having a lot of fun testing this internally. But I'm more curious, how will people use it? What kind of application they're going to try and test on it? And that's where we really like to hear feedback from the community. And also feedback to us. What works out well? What doesn't work out well? What works out well, but surprising them? And what kind of thing they think we should improve on? And those kind of feedback will be tremendously helpful.Swyx [00:37:44]: Yeah. So I've been a production user of Preview and Mini since launch. I would say they're very, very obvious jobs in quality. So much so that they made clods on it. And they made the previous state-of-the-art look bad. It's really that stark, that difference. The number one thing, just feedback or feature requests, is people want control on the budget. Because right now, in 0.1, it kind of decides its own thinking budget. But sometimes you know how hard the problem is. And you want to actually tell the model, spend two minutes on this. Or spend some dollar amount. Maybe it's time you miss dollars. I don't know what the budget is. That makes a lot of sense.Lin [00:38:27]: So we actually thought about that requirement. And it should be, at some point, we need to support that. Not initially. But that makes a lot of sense.Swyx [00:38:38]: Okay. So that was a fascinating overview of just the things that you're working on. First of all, I realized that... I don't know if I've ever given you this feedback. But I think you guys are one of the reasons I agreed to advise you. Because I think when you first met me, I was kind of dubious. I was like... Who are you? There's Replicate. There's Together. There's Laptop. There's a whole bunch of other players. You're in very, very competitive fields. Like, why will you win? And the reason I actually changed my mind was I saw you guys shipping. I think your surface area is very big. The team is not that big. No. We're only 40 people. Yeah. And now here you are trying to compete with OpenAI and everyone else. What is the secret?Lin [00:39:21]: I think the team. The team is the secret.Swyx [00:39:23]: Oh boy. So there's no thing I can just copy. You just... No.Lin [00:39:30]: I think we all come from a very aligned culture. Because most of our team came from meta.Swyx [00:39:38]: Yeah.Lin [00:39:38]: And many startups. So we really believe in results. One is result. And second is customer. We're very customer obsessed. And we don't want to drive adoption for the sake of adoption. We really want to make sure we understand we are delivering a lot of business values to the customer. And we really value their feedback. So we would wake up midnight and deploy some model for them. Shuffle some capacity for them. And yeah, over the weekend, no brainer.Swyx [00:40:15]: So yeah.Lin [00:40:15]: So that's just how we work as a team. And the caliber of the team is really, really high as well. So as plug-in, we're hiring. We're expanding very, very fast. So if we are passionate about working on the most cutting-edge technology in the general space, come talk with us. Yeah.Swyx [00:40:38]: Let's talk a little bit about that customer journey. I think one of your more famous customers is Cursor. We were the first podcast to have Cursor on. And then obviously since then, they have blown up. Cause and effect are not related. But you guys especially worked on a fast supply model where you were one of the first people to work on speculative decoding in a production setting. Maybe just talk about what was the behind the scenes of working with Cursor?Lin [00:41:03]: I will say Cursor is a very, very unique team. I think the unique part is the team has very high technical caliber. There's no question about it. But they have decided, although many companies building coding co-pilot, they will say, I'm going to build a whole entire stack because I can. And they are unique in the sense they seek partnership. Not because they cannot. They're fully capable, but they know where to focus. That to me is amazing. And of course, they want to find a bypass partner. So we spent some time working together. They are pushing us very aggressively because for them to deliver high caliber product experience, they need the latency. They need the interactive, but also high quality at the same time. So actually, we expanded our product feature quite a lot as we support Cursor. And they are growing so fast. And we massively scaled quickly across multiple regions. And we developed a pretty high intense inference stack, almost like similar to what we do for Meta. I think that's a very, very interesting engagement. And through that, there's a lot of trust being built. They realize, hey, this is a team they can really partner with. And they can go big with. That comes back to, hey, we're really customer obsessed. And all the engineers working with them, there's just enormous amount of time syncing together with them and discussing. And we're not big on meetings, but we are like stack channel always on. Yeah, so you almost feel like working as one team. So I think that's really highlighted.Swyx [00:42:38]: Yeah. For those who don't know, so basically Cursor is a VS Code fork. But most of the time, people will be using closed models. Like I actually use a lot of SONET. So you're not involved there, right? It's not like you host SONET or you have any partnership with it. You're involved where Cursor is small, or like their house brand models are concerned, right?Lin [00:42:58]: I don't know what I can say, but the things they haven't said.Swyx [00:43:04]: Very obviously, the drop down is 4.0, but in Cursor, right? So I assume that the Cursor side is the Fireworks side. And then the other side, they're calling out the other. Just kind of curious. And then, do you see any more opportunity on the... You know, I think you made a big splash with 1,000 tokens per second. That was because of speculative decoding. Is there more to push there?Lin [00:43:25]: We push a lot. Actually, when I mentioned Fire Optimizer, right? So as in, we have a unique automation stack that is one size fits one. We actually deployed to Cursor earlier on. Basically optimized for their specific workload. And that's a lot of juice to extract out of there. And we see success in that product. It actually can be widely adopted. So that's why we started a separate product line called Fire Optimizer. So speculative decoding is just one approach. And speculative decoding here is not static. We actually wrote a blog post about it. There's so many different ways to do speculative decoding. You can pair a small model with a large model in the same model family. Or you can have equal pads and so on. There are different trade-offs which approach you take. It really depends on your workload. And then with your workload, we can align the Eagle heads or Medusa heads or a small big model pair much better to extract the best latency reduction. So all of that is part of the Fire Optimizer offering.Alessio [00:44:23]: I know you mentioned some of the other inference providers. I think the other question that people always have is around benchmarks. So you get different performance on different platforms. How should people think about... People are like, hey, Lama 3.2 is X on MMLU. But maybe using speculative decoding, you go down a different path. Maybe some providers run a quantized model. How should people think about how much they should care about how you're actually running the model? What's the delta between all the magic that you do and what a raw model...Lin [00:44:57]: Okay, so there are two big development cycles. One is experimentation, where they need fast iteration. They don't want to think about quality, and they just want to experiment with product experience and so on. So that's one. And then it looks good, and they want to post-product market with scaling. And the quality is really important. And latency and all the other things are becoming important. During the experimentation phase, it's just pick a good model. Don't worry about anything else. Make sure you even generate the right solution to your product. And that's the focus. And then post-product market fit, then that's kind of the three-dimensional optimization curve start to kick in across quality, latency, cost, where you should land. And to me, it's purely a product decision. To many products, if you choose a lower quality, but better speed and lower cost, but it doesn't make a difference to the product experience, then you should do it. So that's why I think inference is part of the validation. The validation doesn't stop at offline eval. The validation will go through A-B testing, through inference. And that's where we offer various different configurations for you to test which is the best setting. So this is the traditional product evaluation. So product evaluation should also include your new model versions and different model setup into the consideration.Swyx [00:46:22]: I want to specifically talk about what happens a few months ago with some of your major competitors. I mean, all of this is public. What is your take on what happens? And maybe you want to set the record straight on how Fireworks does quantization because I think a lot of people may have outdated perceptions or they didn't read the clarification post on your approach to quantization.Lin [00:46:44]: First of all, it's always a surprise to us that without any notice, we got called out.Swyx [00:46:51]: Specifically by name, which is normally not what...Lin [00:46:54]: Yeah, in a public post. And have certain interpretation of our quality. So I was really surprised. And it's not a good way to compete, right? We want to compete fairly. And oftentimes when one vendor gives out results, the interpretation of another vendor is always extremely biased. So we actually refrain ourselves to do any of those. And we happily partner with third parties to do the most fair evaluation. So we're very surprised. And we don't think that's a good way to figure out the competition landscape. So then we react. I think when it comes to quantization, the interpretation, we wrote actually a very thorough blog post. Because again, no one says it's all. We have various different quantization schemes. We can quantize very different parts of the model from ways to activation to cross-TPU communication. They can use different quantization schemes or consistent across the board. And again, it's a trade-off. It's a trade-off across this three-dimensional quality, latency, and cost. And for our customer, we actually let them find the best optimized point. And we have a very thorough evaluation process to pick that point. But for self-serve, there's only one point to pick. There's no customization available. So of course, it depends on what we talk with many customers. We have to pick one point. And I think the end result, like AA published, later on AA published a quality measure. And we actually looked really good. So that's why what I mean is, I will leave the evaluation of quality or performance to third party and work with them to find the most fair benchmark. And I think that's a good approach, a methodology. But I'm not a part of an approach of calling out specific namesSwyx [00:48:55]: and critique other competitors in a very biased way. Databases happens as well. I think you're the more politically correct one. And then Dima is the more... Something like this. It's you on Twitter.Lin [00:49:11]: It's like the Russian... We partner. We play different roles.Swyx [00:49:20]: Another one that I wanted to... I'm just the last one on the competition side. There's a perception of price wars in hosting open source models. And we talked about the competitiveness in the market. Do you aim to make margin on open source models? Oh, absolutely, yes.Lin [00:49:38]: So, but I think it really... When we think about pricing, it's really need to coordinate with the value we're delivering. If the value is limited, or there are a lot of people delivering the same value, there's no differentiation. There's only one way to go. It's going down. So through competition. If I take a big step back, there is pricing from... We're more compared with close model providers, APIs, right? The close model provider, their cost structure is even more interesting because we don't bear any training costs. And we focus on inference optimization, and that's kind of where we continue to add a lot of product value. So that's how we think about product. But for the close source API provider, model provider, they bear a lot of training costs. And they need to amortize the training costs into the inference. So that created very interesting dynamics of, yeah, if we match pricing there, and I think how they are going to make money is very, very interesting.Swyx [00:50:37]: So for listeners, opening eyes 2024, $4 billion in revenue, $3 billion in compute training, $2 billion in compute inference, $1 billion in research compute amortization, and $700 million in salaries. So that is like...Swyx [00:50:59]: I mean, a lot of R&D.Lin [00:51:01]: Yeah, so I think matter is basically like, make it zero. So that's a very, very interesting dynamics we're operating within. But coming back to inference, so we are, again, as I mentioned, our product is, we are a platform. We're not just a single model as a service provider as many other inference providers, like they're providing a single model. We have our optimizer to highly customize towards your inference workload. We have a compound AI system where significantly simplify your interaction to high quality and low latency, low cost. So those are all very different from other providers.Alessio [00:51:38]: What do people not know about the work that you do? I guess like people are like, okay, Fireworks, you run model very quickly. You have the function model. Is there any kind of like underrated part of Fireworks that more people should try?Lin [00:51:51]: Yeah, actually, one user post on x.com, he mentioned, oh, actually, Fireworks can allow me to upload the LoRa adapter to the service model at the same cost and use it at same cost. Nobody has provided that. That's because we have a very special, like we rolled out multi-LoRa last year, actually. And we actually have this function for a long time. And many people has been using it, but it's not well known that, oh, if you find your model, you don't need to use on demand. If you find your model is LoRa, you can upload your LoRa adapter and we deploy it as if it's a new model. And then you use, you get your endpoint and you can use that directly, but at the same cost as the base model. So I'm happy that user is marketing it for us. He discovered that feature, but we have that for last year. So I think to feedback to me is, we have a lot of very, very good features, as Sean just mentioned. I'm the advisor to the company,Swyx [00:52:57]: and I didn't know that you had speculative decoding released.Lin [00:53:02]: We have prompt catching way back last year also. We have many, yeah. So I think that is one of the underrated feature. And if they're developers, you are using our self-serve platform, please try it out.Swyx [00:53:16]: The LoRa thing is interesting because I think you also, the reason people add additional costs to it, it's not because they feel like charging people. Normally in normal LoRa serving setups, there is a cost to dedicating, loading those weights and dedicating a machine to that inference. How come you can't avoid it?Lin [00:53:36]: Yeah, so this is kind of our technique called multi-LoRa. So we basically have many LoRa adapters share the same base model. And basically we significantly reduce the memory footprint of serving. And the one base model can sustain a hundred to a thousand LoRa adapters. And then basically all these different LoRa adapters can share the same, like direct the same traffic to the same base model where base model is dominating the cost. So that's how we advertise that way. And that's how we can manage the tokens per dollar, million token pricing, the same as base model.Swyx [00:54:13]: Awesome. Is there anything that you think you want to request from the community or you're looking for model-wise or tooling-wise that you think like someone should be working on in this?Lin [00:54:23]: Yeah, so we really want to get a lot of feedback from the application developers who are starting to build on JNN or on the already adopted or starting about thinking about new use cases and so on to try out Fireworks first. And let us know what works out really well for you and what is your wishlist and what sucks, right? So what is not working out for you and we would like to continue to improve. And for our new product launches, typically we want to launch to a small group of people. Usually we launch on our Discord first to have a set of people use that first. So please join our Discord channel. We have a lot of communication going on there. Again, you can also give us feedback. We'll have a starting office hour for you to directly talk with our DevRel and engineers to exchange more long notes.Alessio [00:55:17]: And you're hiring across the board?Lin [00:55:18]: We're hiring across the board. We're hiring front-end engineers, infrastructure cloud, infrastructure engineers, back-end system optimization engineers, applied researchers, like researchers who have done post-training, who have done a lot of fine-tuning and so on.Swyx [00:55:34]: That's it. Thank you. Thanks for having us. Get full access to Latent Space at www.latent.space/subscribe
    --------  
    58:25

More Technology podcasts

About Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

The podcast by and for AI Engineers! In 2023, over 1 million visitors came to Latent Space to hear about news, papers and interviews in Software 3.0. We cover Foundation Models changing every domain in Code Generation, Multimodality, AI Agents, GPU Infra and more, directly from the founders, builders, and thinkers involved in pushing the cutting edge. Striving to give you both the definitive take on the Current Thing down to the first introduction to the tech you'll be using in the next 3 months! We break news and exclusive interviews from OpenAI, tiny (George Hotz), Databricks/MosaicML (Jon Frankle), Modular (Chris Lattner), Answer.ai (Jeremy Howard), et al. Full show notes always on https://latent.space www.latent.space
Podcast website

Listen to Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0, Acquired and many other podcasts from around the world with the radio.net app

Get the free radio.net app

  • Stations and podcasts to bookmark
  • Stream via Wi-Fi or Bluetooth
  • Supports Carplay & Android Auto
  • Many other app features
Social
v7.0.1 | © 2007-2024 radio.de GmbH
Generated: 12/17/2024 - 3:54:38 AM