Zum Inhalt springen
_CORE
KI & Agentensysteme Unternehmensinformationssysteme Cloud & Platform Engineering Datenplattform & Integration Sicherheit & Compliance QA, Testing & Observability IoT, Automatisierung & Robotik Mobile & Digitale Produkte Banken & Finanzen Versicherungen Öffentliche Verwaltung Verteidigung & Sicherheit Gesundheitswesen Energie & Versorgung Telko & Medien Industrie & Fertigung Logistik & E-Commerce Retail & Treueprogramme
Referenzen Technologien Blog Know-how Tools
Über uns Zusammenarbeit Karriere
CS EN DE
Lassen Sie uns sprechen

Tokenizace — BPE, WordPiece a zpracování textu

23. 02. 2024 4 Min. Lesezeit intermediate

Tokenization is a key process of text preparation for AI models that divides text into smaller units. Get acquainted with BPE and WordPiece algorithms and understand how text processing works in modern language models.

Tokenisierung im modernen NLP: Von Woertern zu Subwords

Tokenization is a fundamental step in every NLP system. While traditional approaches split text into words by spaces, modern models like GPT or BERT use more advanced techniques like Byte-Pair Encoding (BPE) and WordPiece. These algorithms can elegantly solve the out-of-vocabulary (OOV) word problem and efficiently represent extensive vocabularies.

Warum klassische Tokenisierung nicht ausreicht

Imagine you’re training a model on English texts and encounter the word “unhappiness”. A classical word-based tokenizer would either add this word to the vocabulary (if it appears frequently enough), or mark it as an unknown token . Both approaches have fundamental disadvantages:

  • Large vocabulary takes more memory and slows down training
  • Unknown tokens cause information loss
  • The model cannot learn morphology and word formation

Modern subword tokenization solves these problems by dividing words into smaller meaningful units.

Byte-Pair Encoding (BPE)

BPE originally emerged as a compression algorithm but found application in NLP. The algorithm works as follows:

# Simple BPE implementation
def get_pairs(vocab):
    """Gets all pairs of adjacent symbols"""
    pairs = {}
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pair = (symbols[i], symbols[i + 1])
            pairs[pair] = pairs.get(pair, 0) + freq
    return pairs

def merge_vocab(pair, vocab):
    """Merges the most frequent symbol pair"""
    new_vocab = {}
    bigram = ' '.join(pair)
    replacement = ''.join(pair)

    for word in vocab:
        new_word = word.replace(bigram, replacement)
        new_vocab[new_word] = vocab[word]
    return new_vocab

BPE starts with a vocabulary of individual characters and iteratively merges the most frequent pairs. For the word “unhappiness”, the process might look like this:

# Original state
"u n h a p p i n e s s"

# After several iterations
"un hap p i ness"

# Final tokenization
["un", "hap", "pi", "ness"]

WordPiece-Algorithmus

WordPiece, used in BERT, is similar to BPE but with an important difference. Instead of selecting the most frequent pairs, it selects pairs that maximize the likelihood of training data. It also uses a special prefix “##” to mark sub-tokens that don’t indicate word beginnings:

# WordPiece tokenization
"unhappiness" → ["un", "##hap", "##pi", "##ness"]
"playing" → ["play", "##ing"]
"hello" → ["hello"]  # frequent word remains whole

Praktische Implementierung mit Hugging Face

In practice, we usually use ready-made implementations. Hugging Face Transformers provides a simple API:

from transformers import AutoTokenizer

# GPT-2 uses BPE
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
text = "Tokenization is important for NLP"
tokens = gpt2_tokenizer.tokenize(text)
print(tokens)
# ['Token', 'ization', ' is', ' important', ' for', ' NL', 'P']

# BERT uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = bert_tokenizer.tokenize("unhappiness")
print(tokens)
# ['un', '##hap', '##pi', '##ness']

Training eines eigenen BPE-Tokenizers

For specialized domains, we often need our own tokenizer. SentencePiece is a popular library for this purpose:

import sentencepiece as spm

# Training BPE model
spm.SentencePieceTrainer.train(
    input='training_data.txt',
    model_prefix='my_bpe',
    vocab_size=32000,
    model_type='bpe',
    character_coverage=0.9995,
    split_by_unicode_script=True,
    split_by_number=True
)

# Loading and usage
sp = spm.SentencePieceProcessor(model_file='my_bpe.model')
tokens = sp.encode('Custom tokenizer for English data')
print(tokens)
print(sp.decode(tokens))

Optimierung fuer spezifische Sprachen

Different languages have specifics that affect tokenization. Rich morphology, diacritics, and relatively free word order require attention:

# Example problems with morphologically rich languages
text = "Programming, I program, I programmed"

# Poorly configured tokenizer might create:
# ["Program", "ming", ",", " I", " program", ",", " I", " program", "med"]

# Better tokenization would recognize the root:
# ["program", "##ming", ",", " I", " program", ",", " I", " program", "##med"]

For better results with morphologically rich languages, we recommend:

  • Higher character_coverage (0.9999) due to diacritics
  • Pre-trained models for specific languages when available
  • Preprocessing for diacritic normalization if the task allows

Leistung und Speicheranforderungen

Vocabulary size choice is a trade-off between performance and quality. Larger vocabulary means:

  • Shorter token sequences → faster inference
  • Larger embedding matrix → higher memory requirements
  • More parameters → slower training
# Tokenization analysis
def analyze_tokenization(tokenizer, texts):
    total_tokens = 0
    total_chars = 0

    for text in texts:
        tokens = tokenizer.tokenize(text)
        total_tokens += len(tokens)
        total_chars += len(text)

    compression_ratio = total_chars / total_tokens
    print(f"Compression ratio: {compression_ratio:.2f} chars/token")
    return compression_ratio

# Comparing different tokenizers
gpt2_ratio = analyze_tokenization(gpt2_tokenizer, sample_texts)
bert_ratio = analyze_tokenization(bert_tokenizer, sample_texts)

Zusammenfassung

Tokenisierung ist ein kritischer erster Schritt in jeder NLP-Pipeline. BPE- und WordPiece-Algorithmen loesen elegant das OOV-Wortproblem und ermoeglichen eine effiziente Textdarstellung. Bei der Wahl eines Tokenizers beruecksichtigen Sie die Besonderheiten der Zielsprache, die Datengroesse und die Leistungsanforderungen. Fuer morphologisch reiche Sprachen empfehlen wir die Verwendung vortrainierter Modelle oder die sorgfaeltige Einstellung der Parameter beim Training eigener Tokenizer.

tokenizacebpenlp
Teilen:

CORE SYSTEMS Team

Wir bauen Kernsysteme und KI-Agenten, die den Betrieb am Laufen halten. 15 Jahre Erfahrung mit Enterprise-IT.