How ChatGPT speaks so many languages

A quick overview of datasets, emergent behavior, and tokenization

Rosaria Silipo

Data Science Collective

· ~8 min read · September 13, 2025 (Updated: October 31, 2025) · Free: Yes

Reposted from mydataguest.substack.com

Have you ever wondered how ChatGPT can communicate in so many languages? Officially, ChatGPT supports 58 languages. But according to Ilias Ilm in a January 2025 article, the real number is closer to 95+ languages, including programming languages.

Naturally, ChatGPT is most proficient in English, since the majority of its training data is in English. Still, it performs impressively well in many other languages, including some that are quite rare. So how is this possible? The answer comes down to three key factors that make large language models (LLMs) surprisingly language-flexible at relatively low cost:

Multilingual training data The pre-training corpus contains text from multiple languages, mainly English but also a wide variety of others. This exposure helps the model learn patterns across languages.
Emergent behavior LLMs are trained with a simple objective: predict the next word. Yet, during training, they develop unexpected abilities — such as translation and cross-lingual reasoning — that were never explicitly programmed.
Subword-based tokenization Instead of relying on grammar rules for each language, LLMs use subword tokenization (often with the Byte Pair Encoding algorithm). This method breaks text into common character or byte sequences, making the approach largely language-agnostic.

Training Corpus

The simplest answer to the question "How can ChatGPT be proficient in so many languages?" is that its training set includes text written in multiple languages.

OpenAI has not released a detailed description of the neural architecture or the training data used for GPT-4 (and by extension ChatGPT, which is based on GPT-4). To gain some insight, we can look instead at the training data used for GPT-3.

According to the paper describing the GPT architecture, the training sets for GPT-2 and GPT-3 were drawn largely from English-dominated sources. The distribution of these sources for GPT-3 is summarized in Table 1.

Table 1. Dataset for the pre-training phase of the GPT-3 model. Even though it assembles to 499 billion tokens, the model was trained only on 300 billion tokens. Information and table taken from the book by Sebastian Raschka,"Build a Large Language Model from scratch", Manning Publisher

It's reasonable to assume that some texts from the Reddit dataset were written in languages other than English. In fact, some online summaries estimate that about 93% of documents were in English and 7% in other languages. While 7% may sound small, 7% of billions of tokens still represents a substantial volume of multilingual text.

This has important implications:

No separate sub-models for each language. GPT-3 doesn't contain dedicated language modules; instead, it learns to handle multiple languages implicitly from patterns in the data.
Shared linguistic patterns. The model learns grammar, semantics, and even code across languages by generalizing from the data it has seen.

Interestingly, about 22% of GPT-3's training data came from Reddit posts (2015–2020). Given Reddit's reputation during that era, it's somewhat surprising that ChatGPT developed such a polite conversational style.

But this raises a crucial question: if the model trained on so many languages, how large must its vocabulary be? English alone has around 600,000 words. Adding words from other languages would seem to make the vocabulary size unmanageable. So how is it possible to train such a large model to handle so many languages efficiently?

Emergent Behavior

Another factor that may explain ChatGPT's surprising multilingual abilities, despite the relatively small share of non-English data in its pre-training, is emergent behavior.

Emergent behavior refers to unexpected properties that arise in complex systems made of simpler parts. It's a concept long used in biology and physics, and now increasingly in AI. Such behaviors are not explicitly programmed; instead, they emerge from the collective dynamics of the system.

This phenomenon is still an active area of research. On the one hand, emergent behaviors offer powerful and unexpected benefits. On the other, they remain only partially understood, raising new questions about the predictability of large language models.

Subword-based Tokenization

Considering that emergent behavior is still not fully understood and that the training corpus contained only a limited amount of non-English text, it is useful to look at another key factor: the tokenization process, and whether it works equally well across languages.

There are three main approaches to tokenization:

Word-based tokenization: Splits text into words, treating each word as a token. While simple, this approach quickly becomes impractical. A single language can have hundreds of thousands of words, which already produces a very large vocabulary. Combining multiple languages makes the vocabulary size unmanageably large.
Character-based tokenization: Splits text into individual characters. This keeps the vocabulary very small (roughly ~200 symbols including punctuation and spaces), but it makes input sequences extremely long and inefficient for training.
Subword-based tokenization: A middle ground that balances vocabulary size with sequence length. This is the method used in GPT models, implemented via the Byte Pair Encoding (BPE) algorithm.

How Byte Pair Encoding (BPE) Works

BPE is a frequency-based compression algorithm:

It begins with the smallest possible units: characters (or bytes in newer implementations).
At the start, the text is represented as a sequence of single-character tokens.
The algorithm repeatedly merges the most frequent pair of adjacent tokens into a new token, replacing the old pair.
This process continues until a target vocabulary size is reached.

The result is a set of subword tokens: frequent character sequences, word fragments, or even entire words.

OpenAI's tiktoken Tokenizer

In OpenAI's newer models, the tiktoken tokenizer starts directly from bytes rather than characters. This avoids issues with non-Latin scripts and UTF-8 encoding. Its process can be summarized as follows:

Convert the input string into UTF-8 bytes.
Treat each byte as an initial token.
Iteratively merge the most frequent token pairs into new tokens.
Build a final vocabulary of subword units, each mapped to a unique token ID.

Why This Matters

This frequency-based approach provides several advantages:

Efficient vocabulary size: Keeps the dictionary manageable while still covering many languages.
Cross-linguistic coverage: Captures common roots, stems, and grammatical endings not only in English but also in other languages.
No out-of-vocabulary problem: Any new or rare word can still be represented as a sequence of known subwords or, in the worst case, characters.

In short, subword-based tokenization via BPE (and tiktoken) is what enables ChatGPT to handle such a wide variety of languages efficiently, even with relatively little direct training data in some of them.

Figure 1. An example of character-based BPE. At iteration 0, each character is a token. At each next iteration the most frequent token pair is identified and transformed into a token itself, till a vocabulary of subword units as tokens is built.

Tokenization Examples

To test how ChatGPT's tokenization handles non-English text, I tried it on an unusual Italian sentence.

Sentence: Domani andiamo alla sagra del cinghiale fritto Translation: Tomorrow we are going to the fried boar festival (Yes, these festivals really exist where I come from!)

The word "cinghiale" (boar) is fairly uncommon and likely never appeared in the training set. In a worst-case scenario, the model could still process it by splitting it into single characters: "c", "i", "n", "g", "h", "i", "a", "l", "e". However, we are lucky enough to have a few subwords that fit it. Using byte-level BPE OpenAI tiktoken tokenizer, the word "cinghiale" is split into:

"c"
"ing" (a frequent subword in the vocabulary)
"h"
"iale" (less frequent subword, but common enough to be in the vocabulary)

And the full sentence tokenizes like this:

Token ID Token

23658 'Dom'

5676 'ani'

323 ' and'

35314 'iamo'

23536 ' alla'

274 ' s'

12944 'agra'

1624 ' del'

272 ' c'

287 'ing'

71 'h'

20487 'iale'

282 ' f'

1018 'rit'

998 'to'

Notice that:

"andiamo" (tr: we go) gets split into "and" (a common English token) and "iamo" (a common Italian verb ending).
"cinghiale" (tr: boar) gets split into "c", "ing", "h", and "iale", as we have seen before.
Subwords from English (like "to", "ing", "and") are effectively recycled for Italian words.

This sentence results in only 15 tokens, which is:

much shorter than character-based tokenization would produce and
comparable to word-based tokenization, but without the Out Of Vocabulary (OOV) problem.

Example 2

Sentence: " Spero che ritorni presto l'era del cighiale bianco"

Translation: " I hope that the age of the white boar will come back soon" (from an Italian song lyrics)

Token ID Token

50 'S'

716 'per'

78 'o'

3091 ' che'

21889 ' rit'

1540 'orn'

72 'i'

23095 'prest'

78 'o'

326 ' l'

6 "'"

2473 'era'

1624 ' del'

272 ' c'

287 'ing'

71 'h'

20487 'iale'

97589 'bian'

1030 'co'

Notice that:

The tokenizer recognizes entire words like 'era ' and ' che'.
Common subwords like 'ing', 'per', and 'co' are reused effectively.
The whole sentence is split into 19 tokens, still efficient and linguistically reasonable.

Try it yourself

Using Python

You can experiment with tokenization directly in Python by using the "cl100k_base" tokenization model. This is the same tokenizer family used in most modern LLMs, including the latest GPT models. Here is the Python code to implement the BPE tokenization via the tiktoken library and the cl100k_base model.

import subprocess

import sys

import os

try:

import tiktoken

except ImportError:

subprocess.check_call([sys.executable, "-m", "pip", "install", 'tiktoken'])

import tiktoken

# Load the GPT tokenizer (used by GPT-3.5/4)

enc = tiktoken.get_encoding("cl100k_base")

# Example sentence

sentences = [

"Spero che ritorni presto l'era del cinghiale bianco",

]

for text in sentences:

tokens = enc.encode(text)

decoded = [enc.decode([t]) for t in tokens]

print("\nSentence:", text)

print("Tokens:", decoded)

print("Token IDs:", tokens)

print("Total tokens:", len(tokens))

Using the TikTokenizer App

If you'd rather not write code, you can explore tokenization interactively with the TikTokenizer app. Simply type a sentence on the left, select a tokenizer model on the right, and the app will display the tokenization results below. It's a quick way to see how different tokenizers break down text.

Figure 2. The TikTokenizer app.

How ChatGPT speaks so many languages

We began with a simple question: "How can ChatGPT speak so many languages?"

Three possible explanations emerged:

The training dataset One hypothesis is the scale of the training data. Even though less than 10% of the GPT-3 corpus was non-English, 10% of 300 billion tokens is still a huge amount. Could that alone account for ChatGPT's multilingual fluency? Possibly.
Emergent behavior Another explanation is emergent behavior, a still poorly understood byproduct of large-scale models. Because of their size and complexity, LLMs develop capabilities they weren't explicitly trained for. While their core objective is predicting the next word, they also perform surprisingly well on tasks such as step-by-step reasoning, planning, and translation.
Tokenization A third, and perhaps the most convincing, explanation lies in tokenization. Modern LLMs don't rely on word-based or character-based tokenization, but on subword units. This method is frequency-based rather than grammar-based, making it largely language-agnostic. Many subwords in the vocabulary are derived from English, yet because they are small and flexible, they map well to other languages too.

Among these three hypotheses, tokenization appears to be the most realistic and straightforward explanation. To test this, we ran tokenization experiments with Italian sentences using the cl100k_base tokenizer. The results of the tokenization were acceptable: while many subwords clearly come from English, they still provide a solid foundation for handling Italian text effectively.

You can try similar experiments in your own language. We've provided both Python code and a link to the TikTokenizer app, so you can explore how your sentences are broken down into tokens.

Through this exploration, we've taken a closer look at one of the hidden mechanisms that makes large language models so powerful. We hope this deep dive into tokenization has been clear, useful, and interesting for anyone curious about the secret tricks behind LLMs' multilingual abilities.

Originally published at https://mydataguest.substack.com.

#ai #tokenization #language #multilingual #gpt

< Go to the original

How ChatGPT speaks so many languages

A quick overview of datasets, emergent behavior, and tokenization

Training Corpus

Emergent Behavior

Subword-based Tokenization

Tokenization Examples

Example 2

Try it yourself

How ChatGPT speaks so many languages

Reporting a Problem