ChatGPT maker OpenAI has announced its next major product release: A generative AI model code-named Strawberry, officially called OpenAI o1.
To be more precise, o1 is actually a collection of models. Two are available today in ChatGPT and via OpenAI’s API: o1-preview and o1 mini, a smaller, cheaper model aimed at code generation. You’ll have to be subscribed to ChatGPT Plus or Team to see them in the ChatGP client; enterprise and educational users will get access early next week.
Note that the o1 chatbot experience is fairly barebones at present; unlike ChatGPT, o1 can’t browse the web or analyze files yet. (o1 does have image-analyzing capabilities, but they’ve been disabled pending additional benchmarking.) It’s rate-limited — weekly limits are currently 30 messages for o1-preview and 50 for o1-mini. And the o1 models are expensive. In the API, o1-preview is $15 per 1 million input tokens (3x the cost of GPT-4o) and $60 per 1 million output tokens (4x GPT-4o). (1 million tokens is equivalent to around 750,000 words.)
OpenAI says it plans to bring o1-mini access to all free users of ChatGPT, but hasn’t set a release date. We’ll hold the company to it.
o1 avoids some of the reasoning pitfalls that normally trip up generative AI models, at least according to OpenAI. That’s because o1 can effectively fact-check itself by spending more time considering all parts of a command or question.
OpenAI says that o1, originating from an internal company project known as Q*, is particularly adept at solving math and programming-related challenges. But what makes the text-only o1 “feel” qualitatively different from other generative AI models is its ability to “think” before responding to queries.
When given additional time to “think,” o1 can reason through a task holistically — planning ahead and performing a series of actions over an extended period of time that help it arrive at answers. This makes o1 well-suited for tasks that require synthesizing the results of multiple subtasks, like detecting privileged emails in an attorney’s inbox or brainstorming a product marketing strategy.
“o1 is trained with reinforcement learning,” which teaches the system through rewards and penalties, “to ‘think’ before responding via a private chain of thought,” said Noam Brown, a research scientist at OpenAI, in a series of posts on X. He added that OpenAI used a new optimization algorithm and training data set containing “reasoning data” and scientifically literature specifically tailored for the models.
“The longer [o1] thinks, the better it does on reasoning tasks,” Brown said.
TechCrunch wasn’t offered the opportunity to test o1 before its debut; we aim to get our hands on it as soon as possible. But according to a person who did have access — Pablo Arredondo, VP at Thomson Reuters — o1 is better than OpenAI’s previous models (e.g. GPT-4o) at things like analyzing legal briefs and identifying solutions to problems in LSAT logic games.
“We saw it tackling more substantive, multi-faceted, analysis,” Arredondo told TechCrunch. “Our automated testing also showed gains against a wide range of simple tasks.”
In a qualifying exam for the International Mathematics Olympiad, a high school math competition, o1 correctly solved 83% of problems while GPT-4o only solved 13%, OpenAI claims. The company also claims that the model reached the 89th percentile of participants in the online programming contests known as Codeforces competitions.
In general, o1 should perform better on problems in data analysis, science and coding, OpenAI says.
Now, there is a downside. o1 can be slower than other models, query depending; Arredondo tells us the model can take over ten seconds to answer some questions. (Helpfully, the chatbot version of o1 shows its progress by displaying a label for the current subtask it’s performing.)
Given the unpredictable nature of generative AI models, o1 likely has other flaws and limitations (Brown admitted that o1 also trips up on games of tic-tac-toe, for example). In a technical paper OpenAI released, the company also says that it’s received anecdotal feedback that o1-preview and o1-mini tend to hallucinate — i.e. confidently make stuff up — more than GPT-4o and GPT-4o-mini, and that o1 less often admits when it doesn’t know the answer to a question.
We’ll no doubt learn about the various issues in time — and once we get a chance to test the model ourselves.
Interestingly, OpenAI could — but decided against — showing o1’s raw “chains of thoughts” to users, instead opting for a “model-generated summary.” Why? In a blog post, the company says “competitive advantage” was one reason.
“We acknowledge this decision has disadvantages,” OpenAI writes. “We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer.
we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.
We’d be remiss if we didn’t point out that OpenAI is far from the only AI vendor investigating these types of reasoning methods to improve model factuality. Google DeepMind researchers recently published a study showing that, by essentially giving models more compute time and guidance to fulfill requests as they’re made, the performance of those models can be significantly improved without any additional tweaks.
OpenAI might be first out of the gate with o1. But assuming rivals soon follow suit with comparable models, the company’s real test will be making o1 widely available.
From there, we’ll see how quickly OpenAI can deliver improved versions of o1. The company says it aims to experiment with o1 models that reason for hours, days or even weeks to further boost their reasoning skills.
Source : Techcrunch