Tokenization is a key process of text preparation for AI models that divides text into smaller units. Get acquainted with BPE and WordPiece algorithms and understand how text processing works in modern language models.
Tokenisierung im modernen NLP: Von Woertern zu Subwords¶
Tokenization is a fundamental step in every NLP system. While traditional approaches split text into words by spaces, modern models like GPT or BERT use more advanced techniques like Byte-Pair Encoding (BPE) and WordPiece. These algorithms can elegantly solve the out-of-vocabulary (OOV) word problem and efficiently represent extensive vocabularies.
Warum klassische Tokenisierung nicht ausreicht¶
Imagine you’re training a model on English texts and encounter the word “unhappiness”. A classical word-based tokenizer would either add this word to the vocabulary (if it appears frequently enough), or mark it as an unknown token
- Large vocabulary takes more memory and slows down training
- Unknown tokens cause information loss
- The model cannot learn morphology and word formation
Modern subword tokenization solves these problems by dividing words into smaller meaningful units.
Byte-Pair Encoding (BPE)¶
BPE originally emerged as a compression algorithm but found application in NLP. The algorithm works as follows:
# Simple BPE implementation
def get_pairs(vocab):
"""Gets all pairs of adjacent symbols"""
pairs = {}
for word, freq in vocab.items():
symbols = word.split()
for i in range(len(symbols) - 1):
pair = (symbols[i], symbols[i + 1])
pairs[pair] = pairs.get(pair, 0) + freq
return pairs
def merge_vocab(pair, vocab):
"""Merges the most frequent symbol pair"""
new_vocab = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in vocab:
new_word = word.replace(bigram, replacement)
new_vocab[new_word] = vocab[word]
return new_vocab
BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pairs. For the word “unhappiness”, the process might look like this:
# Original state
"u n h a p p i n e s s"
# After several iterations
"un hap p i ness"
# Final tokenization
["un", "hap", "pi", "ness"]
WordPiece-Algorithmus¶
WordPiece, used in BERT, is similar to BPE but with an important difference. Instead of selecting the most frequent pairs, it selects pairs that maximize the likelihood of training data. It also uses a special prefix “##” to mark sub-tokens that don’t indicate word beginnings:
# WordPiece tokenization
"unhappiness" → ["un", "##hap", "##pi", "##ness"]
"playing" → ["play", "##ing"]
"hello" → ["hello"] # frequent word remains whole
Praktische Implementierung mit Hugging Face¶
In practice, we usually use ready-made implementations. Hugging Face Transformers provides a simple API:
from transformers import AutoTokenizer
# GPT-2 uses BPE
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is important for NLP"
tokens = gpt2_tokenizer.tokenize(text)
print(tokens)
# ['Token', 'ization', ' is', ' important', ' for', ' NL', 'P']
# BERT uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tokenizer.tokenize("unhappiness")
print(tokens)
# ['un', '##hap', '##pi', '##ness']
Training eines eigenen BPE-Tokenizers¶
For specialized domains, we often need our own tokenizer. SentencePiece is a popular library for this purpose:
import sentencepiece as spm
# Training BPE model
spm.SentencePieceTrainer.train(
input='training_data.txt',
model_prefix='my_bpe',
vocab_size=32000,
model_type='bpe',
character_coverage=0.9995,
split_by_unicode_script=True,
split_by_number=True
)
# Loading and usage
sp = spm.SentencePieceProcessor(model_file='my_bpe.model')
tokens = sp.encode('Custom tokenizer for English data')
print(tokens)
print(sp.decode(tokens))
Optimierung fuer spezifische Sprachen¶
Different languages have specifics that affect tokenization. Rich morphology, diacritics, and relatively free word order require attention:
# Example problems with morphologically rich languages
text = "Programming, I program, I programmed"
# Poorly configured tokenizer might create:
# ["Program", "ming", ",", " I", " program", ",", " I", " program", "med"]
# Better tokenization would recognize the root:
# ["program", "##ming", ",", " I", " program", ",", " I", " program", "##med"]
For better results with morphologically rich languages, we recommend:
- Higher character_coverage (0.9999) due to diacritics
- Pre-trained models for specific languages when available
- Preprocessing for diacritic normalization if the task allows
Leistung und Speicheranforderungen¶
Vocabulary size choice is a trade-off between performance and quality. Larger vocabulary means:
- Shorter token sequences → faster inference
- Larger embedding matrix → higher memory requirements
- More parameters → slower training
# Tokenization analysis
def analyze_tokenization(tokenizer, texts):
total_tokens = 0
total_chars = 0
for text in texts:
tokens = tokenizer.tokenize(text)
total_tokens += len(tokens)
total_chars += len(text)
compression_ratio = total_chars / total_tokens
print(f"Compression ratio: {compression_ratio:.2f} chars/token")
return compression_ratio
# Comparing different tokenizers
gpt2_ratio = analyze_tokenization(gpt2_tokenizer, sample_texts)
bert_ratio = analyze_tokenization(bert_tokenizer, sample_texts)
Zusammenfassung¶
Tokenisierung ist ein kritischer erster Schritt in jeder NLP-Pipeline. BPE- und WordPiece-Algorithmen loesen elegant das OOV-Wortproblem und ermoeglichen eine effiziente Textdarstellung. Bei der Wahl eines Tokenizers beruecksichtigen Sie die Besonderheiten der Zielsprache, die Datengroesse und die Leistungsanforderungen. Fuer morphologisch reiche Sprachen empfehlen wir die Verwendung vortrainierter Modelle oder die sorgfaeltige Einstellung der Parameter beim Training eigener Tokenizer.