Regularizace — Dropout, L1/L2 a boj s overfittingem

Overfitting is one of the most common problems when training neural networks. When a model learns training data too perfectly, it loses the ability to generalize to new data. Regularization techniques like Dropout, L1 and L2 regularization are proven tools for solving this problem.

Was ist Overfitting und warum ist es schaedlich¶

Overfitting is one of the most common problems in machine learning. A model learns training data too well – it memorizes every detail including random noise, but then fails on new data. Imagine a student who memorizes a textbook by heart but doesn’t understand the principles and fails the exam.

Regularization is a set of techniques that prevent this problem. They add “friction” to training that prevents the model from over-specializing on training data. Let’s look at three most important techniques.

Dropout – Random “Turning Off” of Neurons¶

Dropout is an elegant technique that randomly deactivates part of neurons during training. It’s like randomly closing your eyes or plugging your ears while learning – you force the brain to rely on various combinations of inputs.

Wie Dropout funktioniert¶

During forward pass, we randomly set selected neurons to zero with probability p (typically 0.2-0.5). The remaining neurons are multiplied by factor 1/(1-p) to preserve overall signal strength.

import torch
import torch.nn as nn

class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, dropout_rate=0.3):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_rate)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.dropout(x)  # Dropout after activation
        x = self.relu(self.fc2(x))
        x = self.dropout(x)
        x = self.fc3(x)
        return x

# Usage
model = SimpleNet(784, 256, 10, dropout_rate=0.4)
model.train()  # Important: dropout only works in train mode

It’s crucial to remember that we only use dropout during training. During inference, we call model.eval(), which deactivates dropout.

L1 and L2 Regularization – Penalizing Large Weights¶

L1 and L2 regularization add penalties to the loss function for model weight magnitude. The principle is simple: large weights often lead to overfitting, so we “penalize” them.

L2-Regularisierung (Weight Decay)¶

L2 regularization adds the term λ∑w² to the loss function, where λ is the regularization coefficient. It penalizes large weights quadratically but doesn’t drive them completely to zero.

import torch.optim as optim

# L2 regularization through weight_decay in optimizer
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

# Or manually in loss function
def l2_regularization(model, lambda_reg=1e-4):
    l2_reg = 0
    for param in model.parameters():
        l2_reg += torch.norm(param, p=2) ** 2
    return lambda_reg * l2_reg

# Usage in training
criterion = nn.CrossEntropyLoss()
output = model(inputs)
loss = criterion(output, targets) + l2_regularization(model)

L1 Regularization – Creating Sparse Models¶

L1 regularization uses λ∑|w| and has an interesting property – it can zero out less important weights, creating sparse models.

def l1_regularization(model, lambda_reg=1e-4):
    l1_reg = 0
    for param in model.parameters():
        l1_reg += torch.norm(param, p=1)
    return lambda_reg * l1_reg

# Combined L1 + L2 regularization (Elastic Net)
def elastic_net_regularization(model, l1_lambda=1e-4, l2_lambda=1e-4):
    l1_reg = sum(torch.norm(p, p=1) for p in model.parameters())
    l2_reg = sum(torch.norm(p, p=2) ** 2 for p in model.parameters())
    return l1_lambda * l1_reg + l2_lambda * l2_reg

Praktische Tipps zur Verwendung von Regularisierung¶

Hyperparameter-Tuning¶

Regularization strength must be carefully tuned. Too weak won’t help, too strong will “suffocate” the model and prevent learning.

Dropout rate: Start with 0.2-0.3 for hidden layers, 0.1-0.2 for input layer
Weight decay: Typically 1e-4 to 1e-6, depends on dataset size
L1 regularization: Usually weaker than L2, start with 1e-5

Kombination von Techniken¶

class RegularizedNet(nn.Module):
    def __init__(self, input_size, hidden_sizes, output_size, dropout_rates):
        super().__init__()
        self.layers = nn.ModuleList()
        self.dropouts = nn.ModuleList()

        # Input layer
        prev_size = input_size
        for i, (hidden_size, dropout_rate) in enumerate(zip(hidden_sizes, dropout_rates)):
            self.layers.append(nn.Linear(prev_size, hidden_size))
            self.dropouts.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        # Output layer (no dropout)
        self.output_layer = nn.Linear(prev_size, output_size)
        self.relu = nn.ReLU()

    def forward(self, x):
        for layer, dropout in zip(self.layers, self.dropouts):
            x = self.relu(layer(x))
            x = dropout(x)
        return self.output_layer(x)

# Model with gradually decreasing dropout
model = RegularizedNet(
    input_size=784,
    hidden_sizes=[512, 256, 128],
    output_size=10,
    dropout_rates=[0.2, 0.3, 0.4]  # Higher dropout in deeper layers
)

# Optimizer with weight decay
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)

Ueberwachung der Wirksamkeit¶

Monitor the difference between training and validation loss. Regularization works when this difference decreases without significantly worsening validation loss.

# Monitoring during training
train_losses = []
val_losses = []

for epoch in range(num_epochs):
    # Training
    model.train()
    train_loss = 0
    for batch in train_loader:
        # ... standard training cycle
        pass

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for batch in val_loader:
            # ... validation without gradients
            pass

    train_losses.append(train_loss)
    val_losses.append(val_loss)

    print(f"Epoch {epoch}: Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

Zusammenfassung¶

Regularisierung ist wesentlich fuer das Training robuster Modelle. Dropout deaktiviert zufaellig Neuronen und zwingt das Modell, redundante Repraesentationen zu lernen. L1/L2-Regularisierung bestraft grosse Gewichte und foerdert einfachere Modelle. Der Schluessel zum Erfolg ist die richtige Hyperparameter-Abstimmung und die Kombination von Techniken. Denken Sie daran: Ein leicht unterfittetes Modell, das generalisiert, ist besser als ein perfekt ueberfittetes Modell, das in der Praxis versagt.

dropoutregularizaceoverfitting

CORE SYSTEMS Team

Wir bauen Kernsysteme und KI-Agenten, die den Betrieb am Laufen halten. 15 Jahre Erfahrung mit Enterprise-IT.

Alle Artikel