
What is Tokenization in NLP (Natural Language Processing)?
Natural Language Processing uses both linguistics and mathematics to connect the languages of humans with the language of computers. Natural language usually comes in one of two forms, text or speech. Through NLP algorithms, these natural forms of communication are broken down into data that can be understood by a machine.
There are many complications working with natural language, especially with humans who aren’t accustomed to tailoring their speech for algorithms. Although there are rules for speech and written text that we can create programs out of, humans don’t always adhere to these rules. The study of the official and unofficial rules of language is called linguistics. NLP methods must also be adapted to handle different languages, each with unique scripts and linguistic structures.
The issue with using formal linguistics to create NLP models is that the rules for any language are complex. The rules of language alone often pose problems when converted into formal mathematical rules. Although linguistic rules work well to define how an ideal person would speak in an ideal world, human language is full of shortcuts, inconsistencies, and errors.
Because of the limitations of formal linguistics, computational linguistics has become a growing field. Using large datasets, linguists can discover more about how human language works and use those findings to inform natural language processing tasks. These models are trained on large text corpora, which provide the foundational data for learning language patterns.
This version of NLP, statistical NLP, has come to dominate the field of natural language processing. Using statistics derived from large amounts of training data, statistical NLP bridges the gap between how language is supposed to be used and how it is actually used.
How does Tokenization Work in Natural Language Processing?
In NLP, tokenization is a simple process that takes raw text (unprocessed input) and converts it into a useful data string. While tokenization is well known for its use in cybersecurity and in the creation of NFTs, in NLP “tokenization” has a distinct meaning compared to “tokenization” in payments/cryptography. Tokenization in the context of NLPs is used to split paragraphs and sentences into smaller units that can be more easily assigned meaning.
The input text is processed by a tokenization algorithm, which segments it into tokens for further analysis.
The first step of the NLP process is gathering the input text (a sentence) and breaking it into understandable parts (words). To explain how tokenization works, the input text is first fed into a tokenizer, which applies a tokenization algorithm to split the raw text into tokens that can be used by language models.
NLP Tokenization Example
Here’s an example of a string of data:
“What restaurants are nearby?“
In order for this sentence to be understood by a machine, tokenization is performed on the string to break it into individual parts. With tokenization, we’d get something like this:
‘what’ ‘restaurants’ ‘are’ ‘nearby’
Depending on the tokenization method, certain words or punctuation marks might be treated as a single token.
This may seem simple, but breaking a sentence into its parts allows a machine to understand the parts as well as the whole. This will help the program understand each of the words by themselves, as well as how they function in the larger text. This is especially important for larger amounts of text as it allows the machine to count the frequencies of certain words as well as where they frequently appear. This is important for later steps in natural language processing.
Why is Tokenization Important in NLP?
Tokenization is a foundational step in Natural Language Processing (NLP) because it transforms unstructured text into a format that machines can understand and work with. As a fundamental NLP method, tokenization prepares text for a variety of NLP tasks by breaking it into smaller units-words, subwords, or characters-called tokens.
Building a vocabulary of unique tokens during this process directly impacts model efficiency and performance. These tokens are then used as the basis for further linguistic and text analysis, serving as a foundational step in analyzing textual data or as model input.
Kinds of Tokenization
There are several different methods that are used to separate words to tokenize them, and these methods will fundamentally change later steps of the NLP process. Some tokenization methods operate at the word level, splitting text into individual words.
Word Tokenization
Word tokenization is the most common version of tokenization. It takes natural breaks, like pauses in speech or spaces in text, and splits the data into its respective words using delimiters(characters like ‘,’ or ‘;’ or ‘“,”‘). While this is the simplest way to separate speech or text into its parts, it does come with some drawbacks.
It’s difficult for word tokenization to separate unknown words or Out Of Vocabulary (OOV) words. Word tokenization can also struggle with rare words that are not present in the model's vocabulary, making it less effective for handling infrequent or unseen terms. This is often solved by replacing unknown words with a simple token that communicates that a word is unknown. This is a rough solution, especially since 5 ‘unknown’ word tokens could be 5 completely different unknown words or could all be the exact same word.
Word tokenization’s accuracy is based on the vocabulary it is trained with. These models have to find the balance between loading words for maximum accuracy and maximum efficiency. While adding an entire dictionary’s worth of vocabulary would make an NLP model more accurate, it’s often not the most efficient method. This is especially true for models that are being trained for a more niche purpose.
Character Tokenization
Character tokenization was created to address some of the issues that come with word tokenization. Instead of breaking text into words, it splits text into individual characters, providing a more granular approach to tokenization. This allows the tokenization process to retain information about Out-of-vocabulary (OOV) words that word tokenization cannot.
Character tokenization doesn’t have the same vocabulary issues as word tokenization as the size of the ‘vocabulary’ is only as many characters as the language needs. For English, for example, a character tokenization vocabulary would have about 26 characters.
While character tokenization solves OOV issues, it isn‘t without its own complications. By breaking even simple sentences into characters instead of words, the length of the output is increased dramatically. With word tokenization, our previous example “what restaurants are nearby” is broken down into four tokens. By contrast, character tokenization breaks this down into 24 tokens, a 6X increase in tokens to work with.
Character tokenization also adds an additional step of understanding the relationship between the characters and the meaning of the words. Sure, character tokenization can make additional inferences, like the fact that there are 5 “a” tokens in the above sentence. However, this tokenization method moves an additional step away from the purpose of NLP, interpreting meaning.
Sub Word Tokenization
Sub word tokenization is similar to word tokenization, but it breaks individual words down a little bit further using specific linguistic rules. One of the main tools they utilize is breaking off affixes. Because prefixes, suffixes, and infixes change the inherent meaning of words, they can also help programs understand a word’s function.
This can be especially valuable for out of vocabulary words, as identifying an affix can give a program additional insight into how unknown words function. Byte pair encoding (BPE) is a popular subword tokenization algorithm that merges frequent character or byte sequences to create subword units, making it effective for handling rare words and improving text encoding in models like GPT-2.
The sub word model will search for these sub words and break down words that include them into distinct parts. For example, the query “What is the tallest building?” would be broken down into ‘what’ ‘is’ ‘the’ ‘tall’ ‘est’ ‘build’ ‘ing’
How does this method help the issue of OOV words? Let’s look at an example:
Perhaps a machine receives a more complicated word, like ‘machinating’ (the present tense of verb ‘machinate’ which means to scheme or engage in plots). It’s unlikely that machinating is a word included in many basic vocabularies.
If the NLP model was using word tokenization, this word would just be converted into just an unknown token. However, if the NLP model was using sub word tokenization, it would be able to separate the word into an ‘unknown’ token and an ‘ing’ token. From there it can make valuable inferences about how the word functions in the sentence.
But what information can a machine gather from a single suffix? The common ‘ing’ suffix, for example, functions in a few easily defined ways. It can form a verb into a noun, like the verb ‘build’ turned into the noun ‘building’. It can also form a verb into its present participle, like the verb ‘run’ becoming ‘running.’
If an NLP model is given this information about the ‘ing’ suffix, it can make several valuable inferences about any word that uses the sub word ‘ing.’ If ‘ing’ is being used in a word, it knows that it is either functioning as a verb turned into a noun, or as a present verb. This dramatically narrows down how the unknown word, ‘machinating,’ may be used in a sentence.
There are multiple ways that text or speech can be tokenized, although each method’s success relies heavily on the strength of the programming integrated in other parts of the NLP process. Tokenization serves as the first step, taking a complicated data input and transforming it into useful building blocks for the natural language processing program to work with.
As natural language processing continues to evolve using deep learning models, humans and machines are able to communicate more efficiently. This is just one of many ways that tokenization is providing a foundation for revolutionary technological leaps.
The transformation of unstructured text into a structured, machine-interpretable sequence of discrete symbols is the foundational first step in nearly all Natural Language Processing (NLP) pipelines. This process, known as tokenization, has evolved from simple heuristic-based methods to sophisticated statistical and subword models that are integral to the performance of large-scale language models. This article provides a rigorous mathematical treatment of the primary tokenization paradigms: from early whitespace and rule-based methods to the statistical and subword algorithms that underpin contemporary systems. We delve into the formalisms of Byte Pair Encoding (BPE), WordPiece, SentencePiece, and the Unigram Language Model, elucidating their objective functions, learning procedures, and decoding mechanisms through a mathematical lens. The aim is to provide a consolidated reference that bridges the gap between the practical implementation of these algorithms and their underlying theoretical principles.
- Introduction
Natural Language Processing operates on the premise of representing human language in a computational form. Let a document \(\mathcal{D}\) be a sequence of characters drawn from a finite alphabet \(\mathcal{C}\) (e.g., Unicode code points). The raw character sequence is often intractable for direct modeling due to its high dimensionality and sparsity. Tokenization is the function \(\tau: \mathcal{C}^* \to \mathcal{V}^*\) that maps a string of characters \(\mathbf{s} \in \mathcal{C}^*\) to a sequence of tokens \(\mathbf{t} = (t_1, t_2, ..., t_n)\), where each token \(t_i\) is an element of a finite vocabulary \(\mathcal{V}\).
The choice of \(\tau\) and the composition of \(\mathcal{V}\) are critical. An optimal vocabulary should balance several competing objectives:
- *Coverage: The ability to represent any string in the language.
- *Efficiency: A compact vocabulary size to limit model parameters.
- *Generalization: The capacity to handle unseen words and morphologically complex forms.
- *Semantic Coherence: Ideally, tokens should correspond to meaningful linguistic units.
This article traces the mathematical evolution of tokenization strategies developed to navigate this trade-off.
1.1 Formal Problem Statement
Given a training corpus \(\mathbb{D} = \{\mathcal{D}_1, \mathcal{D}_2, ..., \mathcal{D}_N\}\), the goal of tokenization learning is to infer a vocabulary \(\mathcal{V}\) and a segmentation function \(\tau\) that optimizes a predefined objective, which is often related to the compressed size of the tokenized corpus or the likelihood of the data under a language model.
- Preliminary and Heuristic Methods
Before the advent of data-driven methods, tokenization relied on hand-crafted rules.
2.1 Whitespace Tokenization
This is the simplest tokenization function. For a string \(\mathbf{s}\), it is split on whitespace characters.
\[\tau_{\text{ws}}(\mathbf{s}) = \text{split}(\mathbf{s}, \text{whitespace})\]
While computationally trivial, it fails for languages like Chinese that do not use whitespace delimiters and treats punctuation ambiguously (e.g., "word." vs. "word .").
2.2 Rule-based Tokenization (using Regex)
This method uses regular expressions to define more sophisticated splitting rules. Let \(\Sigma\) be the alphabet. A regex pattern \(R\) defines a regular language \(L(R) \subseteq \Sigma^*\). The tokenization function \(\tau_R\) uses \(R\) to match and split the string.
For example, a pattern might be designed to split on whitespace but also to separate punctuation:
\[R = \text{(\s+|[''.""?!,:;])}\]
The function \(\tau_R\) is deterministic and based purely on the surface form of the characters. Its effectiveness is limited by the complexity and linguistic knowledge required to design a comprehensive rule set \(R\).
- Statistical Tokenization
Statistical methods move away from hand-crafted rules and use corpus statistics to determine likely word boundaries, particularly in unsegmented languages.
3.1 Maximum Likelihood and Word Segmentation
The problem can be framed as finding the most probable segmentation \(\mathbf{t}^*\) for a character sequence \(\mathbf{s}\).
\[\mathbf{t}^* = \arg \max_{\mathbf{t} \in \mathcal{S}(\mathbf{s})} P(\mathbf{t})\]
where \(\mathcal{S}(\mathbf{s})\) is the set of all possible segmentations of \(\mathbf{s}\) into contiguous subsequences (words), and \(P\) is a language model over word sequences.
A common approach uses a unigram word model. Let \(f(w)\) be the frequency of a word \(w\) in the corpus. The probability of a word is \(P(w) = f(w) / M\), where \(M\) is the total number of word tokens. The probability of a segmentation \(\mathbf{t} = (t_1, ..., t_n)\) is:
\[P(\mathbf{t}) = \prod_{i=1}^{n} P(t_i)\]
The optimal segmentation is then:
\[\mathbf{t}^* = \arg \max_{\mathbf{t} \in \mathcal{S}(\mathbf{s})} \prod_{i=1}^{n} P(t_i)\]
This can be solved efficiently using the Viterbi algorithm, where the dynamic programming state \(dp[i]\) stores the maximum probability of segmenting the first \(i\) characters.
Limitation: This requires a pre-defined vocabulary of words, which is circular. Subword tokenization, discussed later, solves this by building the vocabulary from the data itself.
- Subword Tokenization: The Core of Modern NLP
Subword tokenization addresses the fundamental limitations of word-level models: the vocabulary size explosion and the inability to handle out-of-vocabulary (OOV) words. The core idea is to break down rare words into frequent subword units while keeping frequent words as single tokens.
Let the final vocabulary \(\mathcal{V}\) be a set of subword units. Any word \(w\) can be segmented into a sequence of subwords \(t_1, t_2, ..., t_k\) such that the concatenation \(\text{concat}(t_1, t_2, ..., t_k) = w\). The different subword algorithms are distinguished by how they learn \(\mathcal{V}\) from the corpus and how they segment words at inference time.
- Byte Pair Encoding (BPE)
BPE [1] is a compression algorithm repurposed for tokenization. It is a purely frequency-based, greedy, deterministic method.
5.1 Algorithm Formulation
Input: A training corpus \(\mathbb{D}\), a desired vocabulary size \(V\).
Initialization:
- Normalize the corpus (e.g., lowercasing, Unicode normalization).
- Pre-tokenize the corpus into words, separated by a special end-of-word symbol `</w>`. Let this initial vocabulary be the set of all unique characters (bytes) in the corpus: \(\mathcal{V}_0 = \mathcal{C} \cup \{\text{</w>}\}\).
- Represent the entire corpus as a sequence of these initial symbols.
Iterative Merging:
For \(i = 1\) to \(V - |\mathcal{V}_0|\):
- Find the pair of *adjacent* symbols \((x, y)\) in the current vocabulary representation of the corpus that has the highest frequency.
\[(x, y) = \arg \max_{(x, y) \in \mathcal{P}} \text{count}(x, y)\]
where \(\mathcal{P}\) is the set of all adjacent symbol pairs in the current segmented corpus.
- Create a new symbol \(z = xy\).
- Update the vocabulary: \(\mathcal{V}_i = \mathcal{V}_{i-1} \cup \{z\}\).
- Replace every occurrence of the pair \((x, y)\) in the corpus with the new symbol \(z\).
Output: The final vocabulary \(\mathcal{V} = \mathcal{V}_{V-1}\).
5.2 Tokenization (Encoding)
To tokenize a new word \(w\) (with `</w>` appended):
- Represent the word as a sequence of characters.
- Iteratively apply the learned merge operations in the order they were learned until no more merges are possible. The resulting sequence is the tokenization.
Mathematical Properties:
* Greedy Nature: The algorithm makes the locally optimal choice at each step. It does not guarantee a globally optimal vocabulary for any specific objective like compression ratio.
* Determinism: The segmentation is unique and deterministic for a given word and merge table.
* No Explicit Probability Model: BPE is driven entirely by raw frequency counts, not a probabilistic model of the language.
- WordPiece
WordPiece [2] is the tokenization algorithm used in BERT. It is very similar to BPE in its iterative merging process but uses a different, likelihood-based objective for selecting which pair to merge.
6.1 Algorithm Formulation
Input: A training corpus \(\mathbb{D}\), a desired vocabulary size \(V\). The initialization is identical to BPE.
Iterative Merging:
For \(i = 1\) to \(V - |\mathcal{V}_0|\):
- Train a language model on the current segmented corpus. This is typically a unigram model where the probability of a word \(w\) segmented as \((t_1, ..., t_k)\) is \(P(w) = \prod_{j=1}^k P(t_j)\).
- For every possible pair of *adjacent* symbols \((x, y)\) in the current vocabulary, consider the merge \(z = xy\). Calculate the *score* for this merge, which is the likelihood of the corpus after the merge relative to before. The score is defined as:
\[\text{score}(x, y) = \frac{\text{count}(x, y)}{\text{count}(x) \cdot\text{count}(y)}\]
This formula approximates the mutual information between the two tokens. A high score indicates that \(x\) and \(y\) appear together very frequently relative to their individual frequencies, suggesting they form a coherent unit.
- Merge the pair \((x, y)\) with the highest score.
\[(x, y) = \arg \max_{(x, y) \in \mathcal{P}} \frac{\text{count}(x, y)}{\text{count}(x) + \text{count}(y)}\]
- Update the vocabulary: \(\mathcal{V}_i = \mathcal{V}_{i-1} \cup \{z\}\).
Output: The final vocabulary \(\mathcal{V}\).
6.2 Tokenization (Encoding)
Tokenization in WordPiece is formulated as a greedy, longest-match-first (maximum prefix) algorithm. Given a word \(w\):
- Initialize an empty list of tokens.
- While the word is not empty:
* Find the longest substring \(s\) starting from the beginning of the remaining word that is in the vocabulary \(\mathcal{V}\).
* If no such substring is found (e.g., a single unknown character), replace it with a special `[UNK]` token.
* Otherwise, add \(s\) to the token list.
* Remove the prefix \(s\) from the word.
- The resulting list is the tokenization.
Key Difference from BPE: While both merge iteratively, BPE uses frequency, and WordPiece uses a likelihood-based score, which can be more robust to imbalanced token distributions.
- Unigram Language Model Tokenization
The Unigram model [3] takes a radically different approach. Instead of building the vocabulary bottom-up by merging, it starts with a large, overcomplete vocabulary and prunes it down top-down, using a probabilistic objective to evaluate the quality of the vocabulary and segmentations.
7.1 Probabilistic Model
The fundamental assumption is that the probability of a word \(w\) is the sum of the probabilities of all its possible segmentations \(\mathcal{S}(w)\).
\[P(w) = \sum_{\mathbf{t} \in \mathcal{S}(w)} P(\mathbf{t})\]
where the probability of a segmentation \(\mathbf{t} = (t_1, ..., t_k)\) is defined as the product of the probabilities of its constituent subword tokens:
\[P(\mathbf{t}) = \prod_{i=1}^k p(t_i), \quad \text{subject to} \quad \sum_{t \in \mathcal{V}} p(t) = 1\]
Here, \(p(t)\) is the probability of subword token \(t\), which is a parameter of the model.
7.2 Algorithm Formulation: Expectation-Maximization (EM) and Pruning
Input: A training corpus \(\mathbb{D}\), a desired vocabulary size \(V\).
Initialization:
- Create a massive seed vocabulary \(\mathcal{V}_0\). This is often done by pre-tokenizing into words and then taking all frequent substrings (e.g., using BPE or a similar method) to create a vocabulary much larger than \(V\).
- Initialize the subword probabilities \(p(t)\). A common method is to set \(p(t) \propto \text{frequency of } t \text{ in the corpus under some initial segmentation}\).
Iterative Optimization:
The algorithm then alternates between two steps:
- *E-step (Segmentation): Given the current model \((\mathcal{V}_i, p(\cdot))\), for each word \(w\) in the corpus, find the most probable segmentation:
\[\mathbf{t}^*(w) = \arg \max_{\mathbf{t} \in \mathcal{S}(w)} P(\mathbf{t}) = \arg \max_{\mathbf{t} \in \mathcal{S}(w)} \sum_{i=1}^{|\mathbf{t}|} \log p(t_i)\]
This can be efficiently computed using the Viterbi algorithm.
- *M-step (Probability Estimation): Given the segmentations \(\{\mathbf{t}^*(w)\}\) for all words, re-estimate the subword probabilities by maximizing the likelihood of the corpus. This is simply:
\[p(t) = \frac{n(t)}{\sum_{t' \in \mathcal{V}_i} n(t')}\]
where \(n(t)\) is the expected count of subword \(t\) in the corpus under the current most probable segmentations.
- *Vocabulary Pruning: After the EM steps have converged (or periodically), we prune the vocabulary. We compute the *loss* incurred by removing each token \(t\) from the vocabulary. The loss in corpus likelihood when removing \(t\) is approximated by:
\[\mathcal{L}_t = n(t) \cdot \log p(t)\]
We then sort all tokens by their loss \(\mathcal{L}_t\) and remove the \(k\%\) of tokens with the *smallest* loss. These are the tokens whose removal has the least negative impact on the overall likelihood. The probabilities of the remaining tokens are renormalized.
Output: The algorithm repeats the EM steps and pruning until the vocabulary reaches the desired size \(V\).
7.3 Tokenization (Encoding)
At inference time, given a word \(w\) and the final model \((\mathcal{V}, p(\cdot))\), we simply use the Viterbi algorithm to find the most probable segmentation:
\[\mathbf{t}^* = \arg \max_{\mathbf{t} \in \mathcal{S}(w)} \sum_{i=1}^{|\mathbf{t}|} \log p(t_i)\]
This is a significant advantage: the segmentation is directly optimized for the probabilistic model's objective, unlike the greedy approaches of BPE and WordPiece.
- SentencePiece: A Unified Framework
SentencePiece [4] is not a single algorithm but a system that implements several tokenization models (including Unigram and BPE) with a crucial design difference: it treats the input text as a raw byte stream, removing the dependency on pre-tokenization.
8.1 Core Innovation: Pre-tokenization Free Modeling
In BPE and WordPiece, the initial unit is a "word" obtained by pre-tokenizing on whitespace and punctuation. SentencePiece directly models the raw text, which includes spaces. It represents the space character with a special symbol (e.g., `_` or `▁`). This allows it to be completely language-agnostic, handling languages with no whitespace (e.g., Chinese) and those with complex morphological rules uniformly.
8.2 Mathematical Integration
When using the Unigram model within SentencePiece, the algorithm is exactly as described in Section 7. The input "sentence" is the raw byte sequence. When using BPE, the merging process operates directly on the bytes of the raw text, including the space symbol. The frequency count for a pair like (`e`, ` `) would be considered, allowing the model to learn units that cross word boundaries if beneficial.
The encoding and decoding processes are lossless because the original string, including spaces, can be perfectly reconstructed by concatenating the tokens and replacing the space symbol with an actual space.
Mathematical and Practical Implications:
* Optimality: The Unigram model is the only one that explicitly optimizes a global probabilistic objective (corpus likelihood) for both vocabulary induction and segmentation. BPE and WordPiece, being greedy, provide no such guarantee.
* Flexibility: The Unigram model can output the top-\(k\) segmentations, which is useful for uncertainty estimation. BPE and WordPiece are deterministic.
* Efficiency: BPE is often the simplest and fastest to train. The Unigram model, with its iterative EM and Viterbi decoding, is more computationally intensive.
* Robustness: SentencePiece's pre-tokenization-free approach makes it more robust across diverse languages and noisy text.
- Summary
Tokenization has matured from a simple pre-processing step into a core component of the statistical modeling pipeline in NLP. The journey from rule-based splitting to probabilistic subword models reflects a broader trend of replacing heuristics with data-driven, learned representations.
We have provided a mathematical foundation for the key algorithms:
* BPE operates on the principle of greedy frequency maximization.
* WordPiece introduces a likelihood-ratio score to guide its greedy merges.
* The Unigram Language Model frames the problem probabilistically, using the EM algorithm and likelihood-based pruning to find a vocabulary that directly maximizes the probability of the training data.
* SentencePiece provides a unifying framework that decouples these algorithms from language-specific pre-processing.
The choice of tokenization algorithm has a non-trivial impact on downstream model performance, affecting everything from training stability and convergence speed to the model's ability to generalize to unseen words and domains. A deep understanding of their mathematical underpinnings is therefore essential for both practitioners and researchers pushing the boundaries of language technology.












