How many times does the letter R appear in the word “strawberry?” According to formidable AI products like GPT-4o and Claude, the answer is twice.
Large language models can write essays and solve equations in seconds. They can synthesize terabytes of data faster than humans can open up a book. Yet, these seemingly omniscient AIs sometimes fail so spectacularly that the mishap turns into a viral meme, and we all rejoice in relief that maybe, there’s still time before we must bow down to our new AI overlords.
The failure of large language models to understand the concepts of letters and syllables is indicative of a larger truth that we often forget: These things don’t have brains. They do not think like we do. They are not human, nor even particularly humanlike.
Most LLMs are built on transformers, a kind of deep learning architecture. Transformer models break text into tokens, which can be full words, syllables, or letters, depending on the model.
“LLMs are based on this transformer architecture, which notably is not actually reading text. What happens when you input a prompt is that it’s translated into an encoding,” Matthew Guzdial, an AI researcher and assistant professor at the University of Alberta, told TechCrunch. “When it sees the word ‘the,’ it has this one encoding of what ‘the’ means, but it does not know about ‘T,’ ‘H,’ ‘E.’”
This is because the transformers are not able to take in or output actual text efficiently. Instead, the text is converted into numerical representations of itself, which is then contextualized to help the AI come up with a logical response. In other words, the AI might know that the tokens “straw” and “berry” make up “strawberry,” but it may not understand that “strawberry” is composed of the letters “s,” “t,” “r,” “a,” “w,” “b,” “e,” “r,” “r,” and “y,” in that specific order. Thus, it cannot tell you how many letters — let alone how many “r”s — appear in the word “strawberry.”
This isn’t an easy issue to fix, since it’s embedded into the very architecture that makes these LLMs work.
TechCrunch’s Kyle Wiggers dug into this problem last month and spoke to Sheridan Feucht, a PhD student at Northeastern University studying LLM interpretability.
“It’s kind of hard to get around the question of what exactly a ‘word’ should be for a language model, and even if we got human experts to agree on a perfect token vocabulary, models would probably still find it useful to ‘chunk’ things even further,” Feucht told TechCrunch. “My guess would be that there’s no such thing as a perfect tokenizer due to this kind of fuzziness.”
This problem becomes even more complex as an LLM learns more languages. For example, some tokenization methods might assume that a space in a sentence will always precede a new word, but many languages like Chinese, Japanese, Thai, Lao, Korean, Khmer and others do not use spaces to separate words. Google DeepMind AI researcher Yennie Jun found in a 2023 study that some languages need up to ten times as many tokens as English to communicate the same meaning.
“It’s probably best to let models look at characters directly without imposing tokenization, but right now that’s just computationally infeasible for transformers,” Feucht said.
Image generators like Midjourney and DALL-E don’t use the transformer architecture that lies beneath the hood of text generators like ChatGPT. Instead, image generators usually use diffusion models, which reconstruct an image from noise. Diffusion models are trained on large databases of images, and they’re incentivized to try to recreate something like what they learned from training data.
Asmelash Teka Hadgu, co-founder of Lesan and a fellow at the DAIR Institute, told TechCrunch, “Image generators tend to perform much better on artifacts like cars and people’s faces, and less so on smaller things like fingers and handwriting.”
This could be because these smaller details don’t often appear as prominently in training sets as concepts like how trees usually have green leaves. The problems with diffusion models might be easier to fix than the ones plaguing transformers, though. Some image generators have improved at representing hands, for example, by training on more images of real, human hands.
“Even just last year, all these models were really bad at fingers, and that’s exactly the same problem as text,” Guzdial explained. “They’re getting really good at it locally, so if you look at a hand with six or seven fingers on it, you could say, ‘Oh wow, that looks like a finger.’ Similarly, with the generated text, you could say, that looks like an ‘H,’ and that looks like a ‘P,’ but they’re really bad at structuring these whole things together.”
That’s why, if you ask an AI image generator to create a menu for a Mexican restaurant, you might get normal items like “Tacos,” but you’ll be more likely to find offerings like “Tamilos,” “Enchidaa” and “Burhiltos.”
As these memes about spelling “strawberry” spill across the internet, OpenAI is working on a new AI product code-named Strawberry, which is supposed to be even more adept at reasoning. The growth of LLMs has been limited by the fact that there simply isn’t enough training data in the world to make products like ChatGPT more accurate. But Strawberry can reportedly generate accurate synthetic data to make OpenAI’s LLMs even better. According to The Information, Strawberry can solve the New York Times’ Connections word puzzles, which require creative thinking and pattern recognition to solve, and can solve math equations that it hasn’t seen before.
Meanwhile, Google DeepMind recently unveiled AlphaProof and AlphaGeometry 2, AI systems designed for formal math reasoning. Google says these two systems solved four out of six problems from the International Math Olympiad, which would be a good enough performance to earn as silver medal at the prestigious competition.
It’s a bit of a troll that memes about AI being unable to spell “strawberry” are circulating at the same time as reports on OpenAI’s Strawberry. But OpenAI CEO Sam Altman jumped at the opportunity to show us that he’s got a pretty impressive berry yield in his garden.
Source : Techcrunch