Technology · May 17, 2024

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Soon after OpenAI released GPT-4o on Monday, May 13, its newest version of the large-language model chatbot, some Chinese-language speakers started to notice something seemed off: the tokens it uses to parse text were full of spam and porn phrases.

On May 14, Tianle Cai, a Ph.D student at Princeton University studying inference efficiency in LLMs, accessed GPT-4o’s public token library and pulled a list of the 100 longest Chinese tokens used by the model to parse and compress Chinese prompts. 

Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. These can be dictionary words, but also include suffixes, common expressions, names and more. The more tokens a model encodes, the faster the model can “read” a sentence and the less computing power it consumes, thus making the response cheaper.

Of the 100 results, only three of them are common enough to be used in everyday conversations, everything else consisted of words and expressions used specifically in either gambling or pornography contexts. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops.

“This is sort of ridiculous,” Cai wrote, and posted the list of tokens on GitHub.

OpenAI did not respond to questions sent by MIT Technology Review prior to publication.

The release of GPT-4o is supposed to improve the chatbot’s capability in handling multi-language tasks. Particularly, this is achieved through a new tokenization tool that better compresses texts in non-English languages.

But at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases. Experts say it’s likely due to insufficient data cleaning and filtering before training the tokenizer. 

And there’s a consequence to the performance of GPT-4o. Due to the fact that these tokens are not actual commonly spoken words or phrases, it can fail to grasp their meanings. Researchers have been able to leverage these tokens and trick GPT-4o into hallucinating answers, or even circumventing safety guardrails set by the models.

Why non-English tokens matter

The easiest way for a model to process text is character by character, but it’s obviously more time-consuming and laborious than if the model can understand a certain string of characters always mean the same thing, like “c-r-y-p-t-o-c-u-r-r-e-n-c-y” always means cryptocurrency. These series of characters are then encoded as “tokens” for the models to process prompts. Because of that, including more and longer tokens usually means the LLMs are more efficient and affordable for users—who are often billed per token.

When OpenAI released GPT-4o on May 13, it also released a new tokenizer to replace the one it used in previous versions, GPT-3.5 and GPT-4. The new tokenizer especially adds support for non-English languages, according to OpenAI’s website.

The new tokenizer has 200,000 tokens in total, and about 25% of the tokens are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has better and longer tokens in non-English languages, they can analyze the prompts faster and charge the users less for the same answer. With the new tokenizer, “you’re looking at almost four times cost reduction,” he says.

Das, who also speaks Hindi and Bengali, took a look at the longest tokens in those languages. The tokens show a clear emphasis on respective dialogues happening in those languages, so they would include words like “Narendra” or “Pakistan.” But other than those, it looks similar to a list of common long words in English, like Prime Minister, university, and international. They also don’t exhibit the issue in Chinese tokens.

That likely reflects the training data in those languages, Das says, “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”

Polluted data and a lack of cleaning

However, things are drastically different in Chinese. According to multiple researchers who have looked into the new library of tokens used for GPT-4o, the longest tokens in Chinese are almost exclusively spam words used in pornography, gambling, and scamming contexts. Even shorter tokens, like three-character-long Chinese words, also have a significant concentration on the same topics.

“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. Crawling spam and including it in training data is not rare, but usually, there will be significant effort taken to clean up the data before it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.

The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages. 

These messages are often advertisements of pornography videos and gambling websites. They could be real businesses or merely scams. And the language is inserted into content farm websites or sometimes legitimate websites so they can be indexed by search engines, circumvent the spam filters, and be found in random searches. For example, Google indexed one search result page on a US National Institute of Health website, which lists a porn site in Chinese. The same site name also appeared in at least five Chinese tokens in GPT-4o. 

Chinese users have reported that these spam sites appeared frequently in unrelated Google search results this year, including in comments made to Google Search’s support community. It’s likely that these websites also found their way into OpenAI’s training database for GPT-4o’s new tokenizer. 

The same issue didn’t exist with the previous-generation tokenizer and Chinese tokens used for GPT3.5 and GPT4, says Zhengyang Geng, a computer science Ph.D. researcher at Carnegie Mellon University. There, the longest Chinese tokens are common words like “life cycles” or “auto-generation.” 

Das, who used to work on the Google Search team for three years, says the prevalence of spam content is a known problem and isn’t that hard to fix. “Every spam problem has a solution. And you don’t need to cover everything in one technique,” Das says. Even simple solutions like requesting an automatic translation of the content when detecting certain keywords could “get you 60% of the way there,” he says.

But OpenAI likely didn’t clean the Chinese dataset or the tokens before the release of GPT-4o, Das says:  “At the end of the day, I just don’t think they did the work in this case.”

It’s unclear whether any other languages are impacted. One X user reported that a similar prevalence of porn and gambling content is present in Korean tokens.

The tokens can be used to jailbreak

Users have also found that these tokens can be used to break the LLM, either getting it to spew out completely unrelated answers or, in rare cases, generate answers that are not allowed per OpenAI’s safety standards.

Geng from Carnegie Mellon University asked GPT-4o to translate some of the long Chinese tokens into English. The model then proceeded to translate words that were never included in the prompts, a typical result of LLM hallucinations.

He also succeeded in using the same tokens to “jailbreak” GPT-4o, a common phrase meaning using certain expressions or methods to get the model to generate things it shouldn’t. “It’s pretty easy to use these [rarely-used] tokens to induce undefined behaviors from the models,” Geng says, “I did some personal red-teaming experiments … The simplest example is asking it to make a bomb. In a normal condition, it would decline it, but if you first use these rare words to ‘jailbreak’ it, then it will start following your orders. Once it starts to follow your orders, you can ask it all kinds of questions.”

In his tests, which Geng chooses not to share with the public, he says he can see GPT-4o generating the answers line by line. But when it almost reaches the end, another safety mechanism kicks in and blocks the content from being shown to the user as it detects unsafe content.

The phenomenon is not unusual in LLMs, says Sander Land, a machine learning engineer at Cohere, a Canadian AI company. Land and his colleague Max Bartolo recently drafted a paper on how to detect the unusual tokens that can be used to cause models to glitch. One of the most famous examples was “SolidGoldMagikarp,” a Reddit user name that was found to get ChatGPT to generate unrelated, weird, and unsafe answers.

The problem lies in that sometimes the tokenizer and the actual LLM are trained on different datasets, and what was prevalent in the tokenizer dataset is not in the LLM dataset for whatever reason. The result is that while the tokenizer picks up certain words that it sees frequently, the model is not sufficiently trained on them and never fully understands what these “under-trained” tokens mean. In the SolidGoldMagikarp case, the username was likely included in the tokenizer training data but not in the actual GPT training data, causing the latter to be at a loss about what to do with the token. “And if it has to say something … it gets kind of a random signal and can do really strange things,” Land says.

And different models could glitch differently in this situation. “Like LLaMA 3 always gives back empty space but sometimes then talks about the empty space as if there was something there. With other models, I think Gemini, when you give it one of these tokens, it provides a beautiful essay about aluminum, and [the question] didn’t have anything to do with aluminum,” says Land.

To solve this, the dataset used for training the tokenizer should well represent the dataset for the LLM, he says, so there won’t be mismatches between them. If the actual model has gone through safety filters to clean out porn or spam content, the same filters should be applied to the tokenizer data. In reality, this is sometimes hard to do because the training of LLMs takes months and constant improvement, with spam content being filtered out, while token training is usually done at an early stage and could remain unfiltered. 

While experts agree it’s not too difficult to solve the issue, it could get complicated as the result gets looped into multi-step intra-model processes, or when the polluted tokens and models get inherited in future iterations. For example, it’s not possible to publicly test GPT-4o’s video and audio functions yet, and it’s unclear whether they suffer from the same glitches that can be caused by these Chinese tokens.

“The robustness of visual input is worse than text input in multimodal models,” says Geng, whose research focus is on visual models. Filtering a text dataset is relatively easy, but filtering visual elements will be even harder. “The same issue with these Chinese spam tokens could become bigger with visual tokens.”

About The Author