Masters of Scale: Does OpenAI’s new model deliver on the hype? Inside GPT-5 with Jeremy Kahn, with Jeremy Kahn

Does OpenAI’s new model deliver on the hype? Inside GPT-5 with Jeremy Kahn — with Journalist Jeremy Kahn and Pioneers of AI’s Rana el Kaliouby

0:00 / 0:00

About Jeremy

Fortune AI Editor; leads Eye on AI newsletter and Brainstorm AI conferences
Award-winning journalist covering AI and emerging tech for Fortune
Author of Mastering AI, a broad guide to AI's impact on business & society
Former managing editor of The New Republic
Bylines in NYT, The Atlantic, Bloomberg, Newsweek, Slate, Smithsonian

What feels new in GPT five right away
How the model likely works behind the scenes
Separating real progress from the hype cycle
Why coding gains matter more than benchmarks alone
What lower hallucination rates actually mean in practice
How OpenAI is using pricing and liquidity to stay competitive
The tension between engagement and mental health safeguards
What user behavior reveals about ChatGPT's strongest use cases
Why AGI claims still outpace the science
Episode Takeaways

Transcript:

Does OpenAI’s new model deliver on the hype? Inside GPT-5 with Jeremy Kahn

RANA EL KALIOUBY: For weeks now we’ve been hearing about OpenAI’s newest AI model, and finally, on Thursday OpenAI released a much anticipated GPT five. It’s a big deal. So big that we’re dropping this bonus episode in your feeds to talk about it. So today on the podcast, we’re giving you a GPT five update. How revolutionary is it? What’s different about it, and what does it mean for the future of OpenAI?

EL KALIOUBY: I’m Rana el Kaliouby and this is Pioneers of AI – a podcast taking you behind-the-scenes of the AI revolution. And here to help us unpack all things GPT five is Fortune’s AI editor Jeremy Kahn. Thank you so much for joining us again on the show. And I know you’re on vacation, so we really appreciate you making time for us.

[THEME MUSIC]

JEREMY KAHN: Ah, thanks for having me on again, Ron.

Copy LinkWhat feels new in GPT five right away

EL KALIOUBY: All right, so right off the bat, what are your first impressions of GPT five?

KAHN: Yeah, I think it’s a very good model. There were people who were expecting this would be the AGI kind of moment, and I don’t think it’s quite that. It’s a very good model. It definitely shows some improvement over what was available before. It’s a leap forward, but not maybe a massive leap forward.

EL KALIOUBY: So for our listeners who have not yet had a chance to play with it, is there anything different or noteworthy about the interface?

KAHN: Yeah. So, okay. This is the first time OpenAI has had a model where you don’t have to pick whether you want to use reasoning capabilities, or use something that’s a faster response from where it draws simply from the pre-training data. Before, the interface required you to select which model you wanted to answer your query.

Now you can just ask the question. The model itself decides how to answer that question, whether it should use reasoning, whether it shouldn’t use reasoning, how long it should think, quote unquote, about the answer. And all that wasn’t the case before.

So part of the way GPT five works is as a kind of router system. It’s actually a family of models, and it’s gonna determine which of those models to send your prompt to based on the content of the prompt. This is something that others, some of the competing model providers, had already built into systems.

So this was true with Gemini 2.5 Pro from Google, and it was also true with the Claude four series from Anthropic. Those models also had this ability to decide, based on your query, how to answer that prompt. This was not true before with OpenAI’s ChatGPT, and now it is. So that’s the biggest difference in the interface.

Copy LinkHow the model likely works behind the scenes

EL KALIOUBY: Do you have a sense of how different was the training process? Like, did they train on just more data or different data? What’s different in the backend?

KAHN: Yeah, I think we don’t entirely know. They have said that it is a family of models of various sizes. Some of them are very fast, and usually if it’s very fast, those models are fairly lightweight. They’re fairly small, and they’ve been trained on highly curated data.

And it may be for certain tasks, for instance coding or math tasks, that they’ve created very fine-tuned models that will handle those particular queries. And then maybe for some other types of queries, if it’s a complicated logic problem where you do need a lot of reasoning, they may have used models that are quite a bit larger and take a lot more time to think. We don’t know exactly how they trained it. We know it performs particularly well on coding, so they almost certainly gave it highly curated data sets on coding. We know it does slightly better at performing some English writing.

So they may have also created some refined data sets to evaluate and kind of post-train the model on those tasks. So there’s clearly something they did in terms of training that’s different. What exactly that is, they’re not telling us.

EL KALIOUBY: Yeah, black box for now. Is there anything different about the personality or the tone?

KAHN: Yeah. So one of the things that they’ve done is actually allow you to select four different personality types. Particularly this is the case if you use the voice mode, and you can get it to answer in different ways. One of them is, I think they call it the Cynic, which is gonna be more skeptical.

It’s gonna be more sarcastic in how it answers. One is called the Listener. It’s more empathetic. There’s one called the Robot, which they thought might be better for — what OpenAI said is maybe it’s better for business use cases where you just want a very concise answer, very factual. And you can adjust the tone based on these four different personality types. That’s another difference in how the interface works with GPT five.

Copy LinkSeparating real progress from the hype cycle

EL KALIOUBY: Yeah. There’s been a lot of hype about GPT five and a lot of anticipation. There are some critics that are saying that the hype is basically a marketing ploy to drive engagement. What do you think?

KAHN: Yeah, like I said, I think it is a very good model, but is it orders of magnitude different than anything we’ve seen from competitors? And the answer is no. So to some extent there is a bit of hype here. And I think it’s interesting because you had someone like Sam Altman saying, oh, this is the first model where it really feels like you’re talking with a PhD level researcher on any topic.

And I don’t know, in my experimentation with it so far, I’d say yeah, it’s good, but do I really feel like I’m talking to a PhD level researcher on every topic? I’m not so sure.

One of the analogies that Altman used in the press conference where they announced this model was to say it was a bit like when iPhones started having retina display cameras instead of the old sort of pixelated cameras.

And that was a big leap forward in photography. But it wasn’t as big a leap forward, I’d say, as going from the pre-iPhone era to the iPhone era. And I think some people were expecting, given the hype around this, that this would be a similar kind of leap, like going from the pre-smartphone era to having smartphones.

And it’s clearly not that. This is an upgrade in features. It’s a difference of degree, but not a difference of kind.

Copy LinkWhy coding gains matter more than benchmarks alone

EL KALIOUBY: Yeah. I love that — difference in degree, but not a difference of kind. Okay, so let’s unpack how it’s doing better on some of these benchmarks. And I wanna start with the model’s coding capabilities. How much better is it? And also what does that mean for some of these platforms like Cursor and Windsurf and Lovable that are basically packaging some of these coding capabilities and have done really well as companies.

KAHN: Yeah. So the coding capabilities are definitely better. They are about eight to 10% better on a lot of coding benchmarks than the previous O3 model from OpenAI. They did not provide benchmarking against competing models from other providers. But we’re starting to see some of those evaluations come out from third parties and it looks like it’s pretty good, but maybe not quite as good as Claude four, still at coding.

So it’s sort of in the range. It’s definitely better than what OpenAI had available before. They have one benchmark called the SWE benchmark, which stands for software engineer. According to OpenAI’s own evaluations, on that it is the best model out there.

It gets 75% on that benchmark, which is very good. It is about 8% better than anything else out there on those questions. But again, in some third party evaluations, people are saying they don’t prefer the answers as much as what Claude four has produced. What this means for those companies like Cursor that are kind of wrappers around various models — for them it’s great. They just get to move these capabilities into their system. It may require, as every new model release does for these companies, some work on the backend themselves, because they use a lot of kind of meta prompting strategies where there are prompts that you may not be aware of as a user that are taking place in the background.

And every time these models are upgraded, the way those prompts are answered changes slightly. So I’m sure they have a little bit of work to do, but in general, it’s good for them because it just gives them a more capable system to play around with. It should be good for the users of all of those kind of wrapper products.

Copy LinkWhat lower hallucination rates actually mean in practice

EL KALIOUBY: Yeah. Okay. So let’s talk about the lower hallucination rates. Is there any data to back this up?

KAHN: Yeah, so they again did evaluations of the models on hallucination rates that’s in the system card for GPT five. But again, it’s compared to their own previous models. It’s not compared in the system card to any competing models from different providers. Now, according to their data, it is significantly better in terms of factual accuracy. But what you should note is that GPT five in the main mode still has a hallucination rate of about 10%. And even in what they call the thinking mode, where the model uses its reasoning abilities and takes a long time to provide responses — in general, that usually produces more factually accurate results. But even there, they’ve gotten the hallucination rate down to about 4.8 or 4.9%. That still means about one out of every 20 times you’re gonna have an inaccuracy. So it does seem, from some of the third party evaluations that have been done and some of the third party comparison of benchmarks, that it’s about on par with Gemini 2.5 Pro in terms of hallucination rate, and again maybe slightly better than Claude four, but it’s close. So yeah, that’s what we know so far.

EL KALIOUBY: Yeah. How do you think this affects OpenAI’s business against some of its competitors? It sounds to me like they’re still neck and neck.

KAHN: Yeah, I think they’re very close. Over the last basically six months, it seemed like they had actually kind of lost their pole position. In a lot of things, Claude was better. And then Gemini 2.5 Pro was a very good model that came out on top on a lot of benchmarks.

So I think this allows them to claim that they’re back in the lead by a little bit. They caught back up and pulled maybe slightly ahead, but they’re not ahead by miles on a lot of this. But I think what it does is, if you were a heavy user of ChatGPT and you had a pro subscription, they’ve basically given you a reason not to switch to somebody else if you were considering abandoning it and thinking, oh, Anthropic’s Claude is so much better now, or Gemini’s really good now, maybe I should consider dropping my OpenAI subscription. They’ve basically given you a reason not to do that.

And meanwhile, the other thing to note about OpenAI is they’re very focused on the consumer business and they’ve rolled GPT five out to everyone, including the people who just use the free version of the service. And I think for a lot of those people who may not have even played around with the reasoning models before because they weren’t available to free users prior to this, they may be kind of blown away by how much better this is.

Compared to what they had before. And it may bring in even more consumer users. They’ve now said they have about 700 million active weekly users. That’s a big number. And they may gain even more users now that this model is available.

So I think in terms of their consumer business, this is good. Also in terms of their enterprise business, for the people who use their API, they’ve priced this pretty aggressively. They have matched the pricing that Google offers for Gemini 2.5, and they have way undercut actually what Anthropic charges for Claude 4.1 Opus.

So I think they may find that a lot of businesses are gonna switch to using them, or swap them in, and it may be very good for their enterprise business.

Copy LinkHow OpenAI is using pricing and liquidity to stay competitive

EL KALIOUBY: Yeah, while we’re at it. OpenAI is allegedly selling shares held by current and former employees at a price that values the company at $500 billion. And one theory is that a little liquidity will stop staff from basically jumping ship to Meta or X or Anthropic. What do you think of that?

And they’ve literally been offering a hundred million dollar signing bonuses. And in some cases even more than that, to lure top researchers. There were even some reports that they were offering one particular researcher shares that might’ve been worth a billion dollars in Meta.

I mean, it’s incredible amounts of money. And we know that when they hired Alex Wang from Scale, they basically did this deal where they made a huge investment in Scale for 14.7 billion. So they’re spending tons of money to poach talent. And OpenAI has been trying to counter those offers to keep staff.

Copy LinkThe tension between engagement and mental health safeguards

EL KALIOUBY: Yeah. It’s an incredible time to be an AI talent for sure. So in other news around OpenAI, they appear to be acknowledging that they need to address the issue of a lot of people using ChatGPT for mental health support. They’ve set up a mental health advisory board.

They’ve also implemented some new guardrails to prevent users from kind of viewing and using the chatbot as a therapist. Can you talk more about that? I think this is really very important, not just for the raw OpenAI models, but for a lot of the companies that sit on top of it that are addressing mental health and addiction or suicide prevention, et cetera.

KAHN: Yeah, it does seem like it’s one of the primary use cases for these chatbots — people using them essentially as kind of therapists, in ways that the users often feel like is improving their mental health. But there have been some documented cases of people who have had detrimental effects on their mental health from conversations with these chatbots, particularly going down rabbit holes on conspiracy theories and those sorts of things.

So I think they’re very much a double-edged sword. OpenAI keeps saying, look, we didn’t train this to be a therapist. It’s not a doctor. It shouldn’t necessarily be used in that way. But I think users are doing it anyway.

So they’ve done some things to try to provide disclaimers to users. But the models haven’t created a situation where they will refuse to provide therapy-like advice. They have tried to make that advice safer. So the model should not tell you that it’s okay to self-harm or anything like that.

I think actually they’re quite happy with it. It makes the model very engaging in a lot of ways. People who are using it as a therapist are not likely to abandon using the model. So I don’t know. I think they’re a little bit two-faced sometimes on this stuff, when it comes to the issue of whether people should be using chatbots as kind of mental wellness tools.

KAHN: Yeah, this is sort of an example of what I was alluding to. On the one hand, they’re very eager to say this is not a doctor, it doesn’t have medical training, it certainly is not supposed to be a therapist. And yet they know that people are using it in this way, and they’re now touting very explicitly how good it is at answering these health questions. So I feel like it’s a little bit of doublespeak there on some of this. They keep saying, look, there’s no patient-doctor confidentiality here, and their backend is not HIPAA compliant.

But anyway, getting back to what OpenAI said about the health performance of this model.

They did evaluations on how it does with health questions, and they’re saying it does much better than any previous model out there. They have claimed it’s better than competing models at essentially making diagnoses. Which is interesting. Again, this is an area where they know users like to use chatbots in this way, find it very helpful. And they’re very much leaning into this as a kind of consumer-facing product, and this is part of that.

Copy LinkWhat user behavior reveals about ChatGPT's strongest use cases

KAHN: Yeah, I’ve seen this. So yeah, a lot of people have used this to say, aha, the people who are really using ChatGPT are students, both in high school and at the university level. And it dropped then because by then is when most schools were letting out.

And kids are just using these models rampantly, and that’s the primary use case of ChatGPT. I mean, to the extent that we know the token usage graph is accurate.

I do think students are a big user group. It is true that students are using these models a lot for schoolwork now. Some of that’s cheating and some of that may actually be quite legitimate use cases. One of the things that OpenAI did before releasing GPT five — the week before, they had rolled out this new study mode for ChatGPT.

Where you can specifically ask it to act as a tutor, take on the role of a tutor, and it won’t give away the answers directly to you. It kind of leads you through Socratic questioning towards the answer yourself. It can literally create a whole curriculum for you, walk you through topics. It’s actually quite powerful.

My Fortune colleague, Sharon Goldman, wrote a column about this. She said she’d been terrible at high school algebra and had always been really ashamed of it. And then with study mode, she literally had it create an algebra curriculum for her, and it walked her through it. And I think this is like one of the great use cases of these things. Things like study mode are fantastic.

Copy LinkWhy AGI claims still outpace the science

EL KALIOUBY: Yeah. So I wanna come back to something Sam Altman said during a press briefing about GPT five. He basically said this is a significant step along the path towards artificial general intelligence. Why do you think he said that?

KAHN: Oh, well, they’ve been hyping this model for a long time. And we know that their goal as a company is to explicitly achieve AGI and make sure that its benefits accrue widely to all of humanity. So saying that’s a significant step towards AGI, I think is a way of saying, look, we’re still on this mission. We’re still in the leadership pole position in this race to get to AGI. And I think this whole thing about AGI, it’s a very undefined term, as you know.

I don’t think anybody really knows what that means anymore. And we don’t have a good benchmark for AGI. So it’s unclear how close we really are. Are we closer than we were yesterday? Yeah, probably a little bit, but how much is very hard to tell.

EL KALIOUBY: Yeah. Altman said — and I love this — it is not a model that continuously learns as it’s deployed from new things it finds. Which is something that to me feels like it should be part of AGI. I totally agree with that. For some of our listeners who may not be familiar with how this works, they’re basically training a new model, deploying it, and then this model is in production. It’s not updating itself on the go. And I do feel like that ought to be part of AGI.

KAHN: Yeah. No, I totally agree. Continuous learning, I think, would definitely need to be part of AGI. And something about learning efficiency too is not really taken into account. Humans can learn from very few examples. It’s not totally clear that this is what happens with these models. So I do think there are several bits of this that are still not there.

EL KALIOUBY: All righty. So to close us out — is GPT five revolutionary, and what does it tell us about where AI is heading?

KAHN: I would say GPT five is, again, a good model, but it’s not revolutionary. It’s very much an evolutionary step in the development of these models. It is not some great discontinuous leap forward.

It tells you actually how bad the evaluations are, because a lot of — and I think OpenAI even said this — it’s like this model has good vibes.

I mean, that is what a lot of people say. It feels better than the other models. And there is something to that, but it’s also a really unscientific way to evaluate things.

But look, OpenAI still has the largest user base of any of these competing products. And putting this capability in the hands of so many people is significant.

EL KALIOUBY: Amazing. Well, thank you Jeremy for joining us. This was so helpful.

KAHN: Yeah. Thanks for having me, Rana.

Episode Takeaways

Fortune AI editor Jeremy Kahn says GPT-5 is a meaningful upgrade for OpenAI, but not the AGI moment some hoped for—more leap forward than true revolution.
One big shift is usability: GPT-5 routes prompts across a family of models and lets users choose different personalities, making ChatGPT feel more seamless and adaptive.
On coding and accuracy, GPT-5 appears stronger than OpenAI’s prior models, though Jeremy notes it remains roughly neck-and-neck with Claude and Gemini rather than miles ahead.
That may still be enough to help OpenAI defend its lead, especially as it rolls GPT-5 out to free users, prices it aggressively for enterprises, and tries to retain top talent.
The episode also highlights OpenAI’s contradictions around health and therapy use, and lands on a clear verdict: GPT-5 feels like an important evolutionary step, not a transformative break.