Zum Inhalt springen
_CORE
KI & Agentensysteme Unternehmensinformationssysteme Cloud & Platform Engineering Datenplattform & Integration Sicherheit & Compliance QA, Testing & Observability IoT, Automatisierung & Robotik Mobile & Digitale Produkte Banken & Finanzen Versicherungen Öffentliche Verwaltung Verteidigung & Sicherheit Gesundheitswesen Energie & Versorgung Telko & Medien Industrie & Fertigung Logistik & E-Commerce Retail & Treueprogramme
Referenzen Technologien Blog Know-how Tools
Über uns Zusammenarbeit Karriere
CS EN DE
Lassen Sie uns sprechen

Transformer-Architektur - Ein vollstaendiger Leitfaden

15. 09. 2024 4 Min. Lesezeit intermediate

Transformer architecture represents breakthrough technology in artificial intelligence that powers the most modern language models like GPT or BERT. This guide will simply explain how key mechanisms like attention and self-attention work, and why Transformers are so effective.

Was ist die Transformer-Architektur

Transformer architecture represents a revolutionary approach to sequence processing that has dominated Natural Language Processing since 2017. The key innovation is the self-attention mechanism, which allows the model to track relationships between all positions in a sequence simultaneously, unlike sequential processing in RNN or LSTM networks.

The basic principle lies in transforming input token sequences into vectors using an attention mechanism that weights the importance of individual positions for each element in the sequence. This allows the model to learn contextual word representations, where meaning depends on the entire sentence context.

Transformer-Architektur

Encoder-Decoder-Struktur

The original Transformer consists of two main parts:

  • Encoder - processes input sequence and creates contextual representations
  • Decoder - generates output sequence based on encoded representations

Each part contains a stack of several identical layers (typically 6), with each layer having two main components: multi-head attention and feed-forward network.

Self-Attention-Mechanismus

The heart of the Transformer is scaled dot-product attention:

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: Query matrix [batch_size, seq_len, d_model]
    K: Key matrix [batch_size, seq_len, d_model]
    V: Value matrix [batch_size, seq_len, d_model]
    """
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)

    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)

    return output, attention_weights

The mechanism works by creating three vectors for each token - Query (what I’m looking for), Key (what I’m offering) and Value (what I’m passing). Attention score is calculated as dot product between Query and Key vectors, normalized by vector length.

Multi-Head Attention

Multi-head attention allows the model to track different types of relationships simultaneously:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Split into heads
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Attention for each head
        attention, _ = scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads
        attention = attention.transpose(1, 2).contiguous().view(
            batch_size, -1, self.d_model
        )

        return self.W_o(attention)

Positional Encoding

Since Transformers don’t have an inherent concept of order, we must explicitly encode token positions. Sinusoidal functions are used:

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1).float()

    div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                        -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

This encoding allows the model to distinguish positions and learn distance-dependent relationships between tokens.

Varianten der Transformer-Architektur

BERT (Bidirectional Encoder Representations)

BERT uses only the encoder part and trains bidirectionally using masked language modeling:

  • Randomly masks 15% of tokens in input sequence
  • Learns to predict masked tokens based on entire context
  • Excellent for text understanding tasks (classification, NER, QA)

GPT (Generative Pre-trained Transformer)

GPT uses only the decoder part with causal masking:

def create_causal_mask(seq_len):
    """Creates mask for autoregressive generation"""
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, seq_len]

This mask ensures that when predicting token at position i, the model sees only tokens at positions 0 through i-1.

Grundlegende Transformer-Implementierung

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        self.feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.attention(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))

        return x

Vor- und Nachteile

Vorteile:

  • Parallelization - unlike RNN, all positions can be trained simultaneously
  • Long dependencies - attention mechanism enables direct connection of distant positions
  • Interpretability - attention weights provide insight into what the model focuses on
  • Transfer learning - pre-trained models can be fine-tuned for specific tasks

Nachteile:

  • Computational complexity - O(n²) with respect to sequence length
  • Memory requirements - attention matrix grows quadratically
  • Data requirements - requires large amounts of training data for good performance

Zusammenfassung

Die Transformer-Architektur stellt einen grundlegenden Fortschritt in der Sequenzverarbeitung dar. Ihre Schluesselinnovationen - Self-Attention-Mechanismus und Parallelisierbarkeit - ermoeglichten die Entwicklung fortgeschrittener Modelle wie GPT, BERT und deren Nachfolger. Fuer die Praxis ist es wichtig zu verstehen, dass verschiedene Varianten (Encoder-only, Decoder-only, Encoder-Decoder) fuer verschiedene Aufgabentypen geeignet sind. Waehrend die Implementierung komplex sein kann, sind die Prinzipien elegant und bieten eine solide Grundlage fuer das Verstaendnis moderner KI-Systeme.

transformergptbert
Teilen:

CORE SYSTEMS Team

Wir bauen Kernsysteme und KI-Agenten, die den Betrieb am Laufen halten. 15 Jahre Erfahrung mit Enterprise-IT.