> For the complete documentation index, see [llms.txt](https://book.bsdcn.org/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://book.bsdcn.org/ask/flat/chapter-20-artificial-intelligence/di-20.2-jie-transformer-shu-xue-ji-chu-yu-cheng-xu-yan-shi.md).

# 20.2 Mathematical Foundations of Transformers with Code Examples

## Mathematical Foundations of Artificial Intelligence

In 2017, Vaswani et al. proposed a sequence transduction model based entirely on attention mechanisms, discarding recurrence and convolutions—the Transformer. This section strictly follows the original paper to reconstruct all its mathematical foundations and supplements necessary background knowledge.

Many core machine learning models fundamentally rely on Linear Algebra principles for representation and computation. In practice, data rarely appears as simple single values; it typically manifests as datasets, i.e., collections of large numbers of data points. Linear algebra provides tools for effectively organizing, processing, and analyzing such data, enabling practitioners to represent structured data (such as tabular data) and unstructured data (such as images or video) through objects like vectors, matrices, and tensors.

Uncertainty Quantification (UQ) aims to quantify and reduce uncertainties in modeling and simulation of physical systems; when certain factors of a system are unknown, it attempts to provide confidence levels for research results.

Statistician George Box once said, "All models are wrong, but some are useful."

### Attention Definition

The mathematical definition of Scaled Dot-Product Attention proposed by Vaswani et al. in the 2017 Transformer paper has at its core a **weighted sum**, implemented through Query, Key, and Value matrices. The standard definition of attention is as follows:

$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\Biggl(\frac{Q K^\top}{\sqrt{d\_k}}\Biggr) V
$$

Where:

$$
\begin{array}{rl}
Q \in \mathbb{R}^{n \times d\_k}, & \text{Query matrix} \\
K \in \mathbb{R}^{m \times d\_k}, & \text{Key matrix} \\
V \in \mathbb{R}^{m \times d\_v}, & \text{Value matrix} \\
d\_k, & \text{Dimension of key vectors, used for scaling} \\
\mathrm{softmax}(\cdot), & \text{Normalize each row to obtain weights} \\
\text{Similarity matrix: } S = Q K^\top, & \\
\text{Scaling: } S\_\mathrm{scaled} = \frac{S}{\sqrt{d\_k}}, & \\
\text{Weight matrix: } A = \mathrm{softmax}(S\_\mathrm{scaled}), & \\
\text{Output: } \mathrm{Attention}(Q,K,V) = A V &
\end{array}
$$

### Definition of Vectors

Mathematically, a vector is an **ordered array** that represents a quantity with both magnitude and direction. For example, a two-dimensional vector can be written as:

$$
\mathbf{v} = \begin{bmatrix} v\_1 \ v\_2 \end{bmatrix}, \quad v\_1, v\_2 \in \mathbb{R}
$$

Generally, a vector of length d:

$$
\mathbf{v} = \begin{bmatrix} v\_1 \ v\_2 \ \vdots \ v\_d \end{bmatrix} \in \mathbb{R}^d
$$

Vectors have the following basic operations:

* **Addition**: element-wise addition

$$
\mathbf{u} + \mathbf{v} = \begin{bmatrix} u\_1 + v\_1 \ u\_2 + v\_2 \ \vdots \ u\_d + v\_d \end{bmatrix}
$$

* **Scalar multiplication**: each component of the vector multiplied by a scalar

$$
c \mathbf{v} = \begin{bmatrix} c v\_1 \ c v\_2 \ \vdots \ c v\_d \end{bmatrix}
$$

* **Dot product**: measures the similarity between two vectors

$$
\mathbf{u} \cdot \mathbf{v} = \sum\_{i=1}^{d} u\_i v\_i
$$

### Weighted Sum

Given vectors v₁,…,vₘ and corresponding weights α₁,…,αₘ, where the weights are non-negative and sum to 1, the output vector o is their weighted sum.

$$
\mathbf{o} = \sum\_{i=1}^{m} \alpha\_i v\_i, \quad \alpha\_i \ge 0, \quad \sum\_{i=1}^{m} \alpha\_i = 1
$$

This is the basic operation of **selectively aggregating information from multiple sets of vectors**. Attention performs exactly this kind of weighted sum, but the weights are determined by the query vector.

### Query, Key, Value Vectors

$$
\begin{array}{rl}
\text{Query vector: } & q \in \mathbb{R}^{d\_k} \\
\text{Key vector set: } & k\_1, k\_2, \dots, k\_m \in \mathbb{R}^{d\_k} \\
\text{Value vector set: } & v\_1, v\_2, \dots, v\_m \in \mathbb{R}^{d\_v} \\
\text{Similarity: } & s\_i = q \cdot k\_i, \quad i = 1, \dots, m \\
\text{Weights: } & \alpha\_i = \frac{\exp(s\_i)}{\sum\_{j=1}^{m} \exp(s\_j)}, \quad i = 1, \dots, m \\
\text{Output vector: } & \text{output} = \sum\_{i=1}^{m} \alpha\_i v\_i
\end{array}
$$

### Similarity and Softmax

Using dot product to measure the similarity between query and key:

$$
s\_i = q \cdot k\_i
$$

Then converting these similarities to probabilities:

$$
\alpha\_i = \frac{\exp(s\_i)}{\sum\_{j=1}^{m} \exp(s\_j)}
$$

Thus, the values corresponding to the most relevant keys receive larger weights.

### Vector-Matrix Form

Stacking m value vectors into matrix V, key matrix K, and query matrix Q, we obtain the matrix form of Attention:

$$
\begin{array}{rl}
V \in \mathbb{R}^{m \times d\_v}, \quad K \in \mathbb{R}^{m \times d\_k}, \quad Q \in \mathbb{R}^{n \times d\_k} & \\
S = Q K^\top & \\
S\_\mathrm{scaled} = \frac{S}{\sqrt{d\_k}} & \\
A = \mathrm{softmax}(S\_\mathrm{scaled}) & \\
\mathrm{Attention}(Q,K,V) = A V &
\end{array}
$$

## Program Example

The following example learns dynamic associations of character sequences through the attention mechanism in continuous vector space, completing the autoregressive generation of the "你好世界" (Hello World) sequence.

```python
# Import PyTorch core library
import torch                               # Deep learning framework: provides tensor operations and automatic differentiation
import torch.nn as nn                      # Neural network modules: Linear, Embedding, LayerNorm, ModuleList, etc.
import torch.nn.functional as F            # Functional interface: provides stateless operations such as softmax
import torch.optim as optim                # Optimizer module: provides parameter update algorithms such as Adam
import math                                # Math library: used for sqrt to compute scaling factor

# ============================================================
# 0. vocab — Vocabulary definition
# ============================================================
# The example uses only 4 Chinese characters as a minimal vocabulary for understanding each step of attention computation
chars = ["你", "好", "世", "界"]            # Vocabulary: 4 characters
vocab_size = len(chars)                     # Vocabulary size = 4

# Forward mapping: character → integer index, for model input
char2idx = {c: i for i, c in enumerate(chars)}
# Reverse mapping: integer index → character, for restoring Chinese characters during inference
idx2char = {i: c for c, i in char2idx.items()}

# ============================================================
# 1. Transformer hyperparameters (standard Transformer style)
# ============================================================
d_model = 512                               # Model hidden dimension (embedding dimension), classic Transformer configuration
n_heads = 8                                 # Number of multi-head attention heads
n_layers = 6                                # Number of stacked Transformer blocks
d_ff = 2048                                 # Feed-forward network intermediate dimension (typically 4x d_model)

# Ensure d_model is divisible by n_heads, so each head dimension is an integer
assert d_model % n_heads == 0
d_head = d_model // n_heads                 # Dimension per attention head = 512 / 8 = 64

# ============================================================
# 2. Training data — Autoregressive sequence construction
# ============================================================
# Transformer automatically learns the complete process of token → embedding → attention → output logits,
# without manually specifying target_vectors.
#
# Input sequence:  [你, 好, 世, 界]   →  indices [0, 1, 2, 3]
# Target sequence: [好, 世, 界, 你]   →  indices [1, 2, 3, 0]
# This is "autoregressive": given the first t characters, predict the (t+1)-th character

data = torch.tensor([[0, 1, 2, 3]])         # Input tensor, shape (batch=1, seq_len=4)
target = torch.tensor([[1, 2, 3, 0]])       # Target tensor, shape (batch=1, seq_len=4)

# ============================================================
# 3. Multi-Head Attention — Multi-head attention layer
# ============================================================
class MultiHeadAttention(nn.Module):
    """
    Standard Scaled Dot-Product Multi-Head Self-Attention.

    Mathematical definition:
        Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

    Process: input x → linear projection to obtain Q/K/V → split into multiple heads → compute attention scores →
          scale → causal mask → softmax → weighted sum of V → merge heads → output projection
    """

    def __init__(self, d_model, n_heads):
        """
        Parameters:
            d_model: Total model dimension, also the projection dimension for Q/K/V
            n_heads: Number of attention heads
        """
        super().__init__()

        self.d_model = d_model              # Model dimension, e.g., 512
        self.n_heads = n_heads              # Number of attention heads, e.g., 8
        self.d_head = d_model // n_heads    # Dimension per head = 512 / 8 = 64

        # Four bias-free linear projection matrices (all with shape d_model × d_model):
        self.W_Q = nn.Linear(d_model, d_model, bias=False)   # Query projection: x → Q
        self.W_K = nn.Linear(d_model, d_model, bias=False)   # Key projection:   x → K
        self.W_V = nn.Linear(d_model, d_model, bias=False)   # Value projection: x → V
        self.W_O = nn.Linear(d_model, d_model, bias=False)   # Output projection: fusion after concatenation

    def forward(self, x, return_attention=False):
        """
        Forward pass.

        Parameters:
            x:               Input tensor, shape (B, T, D)
            return_attention: Whether to return intermediate computation results (Q/K/V/scores/alpha) for interpretability analysis

        Returns:
            if return_attention=False: out, shape (B, T, D)
            if return_attention=True:  (out, Q, K, V, scores, scores_scaled, alpha)
        """
        B, T, D = x.shape                   # B=batch size, T=sequence length, D=d_model

        # ---------- Step 1: Linear projection x → Q, K, V ----------
        Q = self.W_Q(x)                     # (B, T, D) → (B, T, D), query vector for each token
        K = self.W_K(x)                     # (B, T, D) → (B, T, D), key vector for each token
        V = self.W_V(x)                     # (B, T, D) → (B, T, D), value vector for each token

        # ---------- Step 2: Split into multiple heads ----------
        # view:   (B, T, D) → (B, T, n_heads, d_head)
        # transpose: swap dimensions 1 and 2 → (B, n_heads, T, d_head)
        # After this, each head independently has a T × d_head Q/K/V subspace
        Q = Q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        K = K.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        V = V.view(B, T, self.n_heads, self.d_head).transpose(1, 2)

        # ---------- Step 3: Compute attention scores S = Q · Kᵀ ----------
        # Q: (B, n_heads, T, d_head), Kᵀ: (B, n_heads, d_head, T)
        # scores: (B, n_heads, T, T) — raw similarity between position i and position j
        scores = Q @ K.transpose(-2, -1)

        # ---------- Step 4: Scale S / √d_k ----------
        # Divide by √d_k to prevent dot product values from becoming too large, avoiding softmax gradient saturation
        # With standard configuration d_head = 64, the scaling factor = 8
        scores_scaled = scores / math.sqrt(self.d_head)

        # ---------- Step 5: Causal Mask ----------
        # Autoregressive language models require position i to only attend to tokens at positions ≤ i (no future information)
        # torch.triu(..., diagonal=1) generates an upper triangular matrix (True above the diagonal)
        # For example, when T=4:
        #   [[F, T, T, T],
        #    [F, F, T, T],
        #    [F, F, F, T],
        #    [F, F, F, F]]
        mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
        # Fill masked positions with -inf; after softmax, weights ≈ 0, i.e., "attention forbidden"
        scores_scaled = scores_scaled.masked_fill(mask, float("-inf"))

        # ---------- Step 6: Softmax normalization ----------
        # Apply softmax along the last dimension (key direction) to obtain attention weight distribution α
        # Sum of weights in each row = 1
        alpha = F.softmax(scores_scaled, dim=-1)

        # ---------- Step 7: Weighted sum output = α · V ----------
        # α:  (B, n_heads, T, T)     — attention weights
        # V:  (B, n_heads, T, d_head) — value vectors
        # out: (B, n_heads, T, d_head) — weighted context representation
        out = alpha @ V

        # ---------- Step 8: Merge multiple heads ----------
        # transpose: (B, n_heads, T, d_head) → (B, T, n_heads, d_head)
        # contiguous + view: flatten to (B, T, n_heads * d_head) = (B, T, D)
        out = out.transpose(1, 2).contiguous()
        out = out.view(B, T, D)

        # ---------- Step 9: Output projection ----------
        out = self.W_O(out)                # W_O fuses information from different heads

        # Determine return content based on return_attention flag
        if return_attention:
            return out, Q, K, V, scores, scores_scaled, alpha

        return out


# ============================================================
# 4. FeedForward — Feed-forward network
# ============================================================
class FeedForward(nn.Module):
    """
    Position-wise Feed-Forward Network.
    Independently applies two fully connected layers + ReLU activation to each position's representation:
        FFN(x) = ReLU(x·W₁ + b₁)·W₂ + b₂
    The intermediate dimension d_ff is typically 4x d_model (512 → 2048), expanding then compressing back to original dimension.
    """

    def __init__(self, d_model, d_ff):
        """
        Parameters:
            d_model: Input/output dimension
            d_ff:    Intermediate hidden layer dimension (typically 4x d_model)
        """
        super().__init__()

        self.net = nn.Sequential(
            nn.Linear(d_model, d_ff),      # Up-projection: d_model → d_ff (512 → 2048)
            nn.ReLU(),                      # Nonlinear activation function, introduces nonlinearity
            nn.Linear(d_ff, d_model)        # Down-projection: d_ff → d_model (2048 → 512)
        )

    def forward(self, x):
        return self.net(x)


# ============================================================
# 5. TransformerBlock — Transformer block
# ============================================================
class TransformerBlock(nn.Module):
    """
    A complete Transformer block using Pre-Norm residual structure:

        x  →  LayerNorm  →  MultiHeadAttention  →  +  →  x'
        x' →  LayerNorm  →  FeedForward         →  +  →  x"

    Pre-Norm (normalize before sublayer) is more stable for training than Post-Norm.
    """

    def __init__(self, d_model, n_heads, d_ff):
        """
        Parameters:
            d_model: Model hidden dimension
            n_heads: Number of attention heads
            d_ff:    Feed-forward network intermediate dimension
        """
        super().__init__()

        # First Pre-Norm sublayer: LayerNorm + MultiHeadAttention
        self.ln1 = nn.LayerNorm(d_model)     # Layer normalization, standardizes along the last dimension
        self.attn = MultiHeadAttention(d_model, n_heads)

        # Second Pre-Norm sublayer: LayerNorm + FeedForward
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff)

    def forward(self, x, return_attention=False):
        """
        Forward pass.

        Parameters:
            x:                Input tensor (B, T, D)
            return_attention: Whether to pass to attention layer to capture intermediate values

        Returns:
            Normal mode: output tensor (B, T, D)
            Attention mode: (output, Q, K, V, scores, scores_scaled, alpha)
        """
        if return_attention:
            # Need to capture attention intermediate values:
            #   First LayerNorm, then attention (also returning attention intermediate values), finally residual connection
            attn_out, Q, K, V, scores, scores_scaled, alpha = \
                self.attn(self.ln1(x), return_attention=True)

            x = x + attn_out                # Residual connection 1: x + Attention(LayerNorm(x))
            x = x + self.ffn(self.ln2(x))  # Residual connection 2: x + FFN(LayerNorm(x))

            return x, Q, K, V, scores, scores_scaled, alpha

        else:
            # Normal forward: do not capture intermediate values
            x = x + self.attn(self.ln1(x))  # Residual connection 1
            x = x + self.ffn(self.ln2(x))   # Residual connection 2

            return x


# ============================================================
# 6. GPT — Autoregressive language model
# ============================================================
class GPT(nn.Module):
    """
    Miniature GPT (Generative Pre-trained Transformer) model.

    Architecture composition:
        Token Embedding → added with Position Embedding
        → N TransformerBlocks
        → Final LayerNorm
        → Linear projection head (outputs vocabulary-dimension logits)

    Through return_attention=True, intermediate attention values of the first TransformerBlock can be captured
    for visualization analysis and teaching demonstrations.
    """

    def __init__(
        self,
        vocab_size,                         # Vocabulary size, here 4
        d_model=512,                        # Hidden dimension
        n_heads=8,                          # Number of attention heads
        n_layers=6,                         # Number of Transformer blocks
        d_ff=2048,                          # FFN intermediate dimension
        max_len=128                         # Maximum supported sequence length
    ):
        super().__init__()

        # Token Embedding: maps character indices to d_model-dimensional dense vectors
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        # Position Embedding: assigns a learnable embedding vector for each position 0~max_len-1
        # Enables the model to perceive relative/absolute positions of tokens (Transformer itself has no sequence awareness)
        self.position_embedding = nn.Embedding(max_len, d_model)

        # Use ModuleList instead of Sequential, to traverse by index and extract specific layer outputs
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff)
            for _ in range(n_layers)        # Stack n_layers=6 Transformer blocks
        ])

        # Final layer normalization: stabilizes distribution after all Transformer blocks, before output projection
        self.ln_f = nn.LayerNorm(d_model)
        # Output projection head: maps d_model-dimensional hidden states back to vocab_size dimensions, obtaining logits for each token
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, idx, return_attention=False):
        """
        Forward pass.

        Parameters:
            idx:              Input token indices, shape (B, T)
            return_attention: Whether to return intermediate attention values of the first TransformerBlock

        Returns:
            if return_attention=False: logits, shape (B, T, vocab_size)
            if return_attention=True:  (logits, (Q, K, V, scores, scores_scaled, alpha))
        """
        B, T = idx.shape                    # B=batch size, T=sequence length

        # Generate position indices [0, 1, 2, ..., T-1], shape (1, T), moved to same device as idx
        positions = torch.arange(T).unsqueeze(0).to(idx.device)

        # Token embedding + Position embedding (element-wise addition)
        token_emb = self.token_embedding(idx)       # (B, T) → (B, T, d_model)
        pos_emb = self.position_embedding(positions)  # (1, T) → (1, T, d_model) → broadcast
        x = token_emb + pos_emb                       # Embedding fusion: (B, T, d_model)

        saved = None                         # For saving intermediate results of the first layer's attention

        for i, block in enumerate(self.blocks):
            # Iterate through each TransformerBlock
            if return_attention and i == 0:
                # Only capture attention intermediate values at the first layer (i==0) for analysis
                x, Q, K, V, scores, scores_scaled, alpha = \
                    block(x, return_attention=True)
                saved = (Q, K, V, scores, scores_scaled, alpha)
            else:
                x = block(x)                # Other layers proceed with normal forward pass

        x = self.ln_f(x)                    # Final LayerNorm

        logits = self.head(x)               # (B, T, d_model) → (B, T, vocab_size)

        if return_attention:
            return logits, saved            # Return predictions + first layer attention data

        return logits


# ============================================================
# 7. Model instantiation and optimizer/loss function configuration
# ============================================================
model = GPT(                                # Instantiate GPT model
    vocab_size=vocab_size,                  # Vocabulary size = 4
    d_model=d_model,                        # Dimension = 512
    n_heads=n_heads,                        # 8 attention heads
    n_layers=n_layers,                      # 6 Transformer layers
    d_ff=d_ff                               # FFN intermediate dimension = 2048
)

optimizer = optim.Adam(model.parameters(), lr=1e-4)  # Adam optimizer, learning rate 0.0001
loss_fn = nn.CrossEntropyLoss()                      # Cross-entropy loss: measures the gap between logits and target distribution

# ============================================================
# 8. Training loop
# ============================================================
for epoch in range(1000):                  # Train for 1000 epochs

    optimizer.zero_grad()                   # Clear gradient cache from the previous round

    logits = model(data)                    # Forward pass: input [0,1,2,3] → logits (1,4,4)

    # Compute cross-entropy loss
    # logits.view(-1, vocab_size): (1*4, 4) = (4, 4)  — flatten to 4 samples, each with 4 classes
    # target.view(-1):            (1*4,)  = (4,)      — flatten to 4 target labels
    loss = loss_fn(
        logits.view(-1, vocab_size),
        target.view(-1)
    )

    loss.backward()                         # Backward pass: compute gradients for all parameters
    optimizer.step()                        # Parameter update: optimize along gradient direction

    if epoch % 100 == 0:                   # Output current loss every 100 epochs
        print(f"\nEpoch {epoch}, loss={loss.item():.6f}")

# ============================================================
# 9. Inference and attention visualization
# ============================================================
with torch.no_grad():                       # Disable gradient computation during inference to save memory

    # Set return_attention=True during inference to capture intermediate attention values of the first layer
    logits, saved = model(data, return_attention=True)

    # argmax takes the index with the highest logits at each position, i.e., the model's predicted next character
    pred = torch.argmax(logits, dim=-1)    # (1, 4, 4) → (1, 4)

    # Unpack saved attention intermediate values
    Q, K, V, scores, scores_scaled, alpha = saved

    # ---------- Print input/output ----------
    print("\n==============================")
    print("Input")
    print("==============================")
    print([idx2char[i.item()] for i in data[0]])   # Restore indices to Chinese characters

    print("\n==============================")
    print("Prediction")
    print("==============================")
    print([idx2char[i.item()] for i in pred[0]])

    # ---------- Visualize first layer attention ----------
    # The following takes data from batch=0, head=0 to show the complete attention computation process

    print("\n==============================")
    print("Q")
    print("==============================")
    print(Q[0, 0].detach().numpy().round(3))       # Batch 0, Head 0: (T, d_head)

    print("\n==============================")
    print("K")
    print("==============================")
    print(K[0, 0].detach().numpy().round(3))

    print("\n==============================")
    print("V")
    print("==============================")
    print(V[0, 0].detach().numpy().round(3))

    print("\n==============================")
    print("scores = QK^T")                          # S = Q · Kᵀ, raw similarity
    print("==============================")
    print(scores[0, 0].detach().numpy().round(3))   # (T, T) matrix, each row is the score of a query against all keys

    print("\n==============================")
    print("scores_scaled")                          # S / √d_k, after scaling
    print("==============================")
    print(scores_scaled[0, 0].detach().numpy().round(3))

    print("\n==============================")
    print("softmax alpha")                          # α = softmax(S_scaled)
    print("==============================")
    print(alpha[0, 0].detach().numpy().round(3))    # Each row sums ≈ 1, causal mask makes α[i][j>i]=0
```

Example program output:

```python
Epoch 0, loss=2.033882

Epoch 100, loss=0.000004

Epoch 200, loss=0.000001

Epoch 300, loss=0.000001

Epoch 400, loss=0.000001

Epoch 500, loss=0.000001

Epoch 600, loss=0.000001

Epoch 700, loss=0.000001

Epoch 800, loss=0.000001

Epoch 900, loss=0.000001

==============================
Input
==============================
['你', '好', '世', '界']

==============================
Prediction
==============================
['好', '世', '界', '你']

==============================
Q
==============================
[[-0.35  -0.407 -0.483  0.202 -1.21   0.052 -0.176 -0.561 -0.954  0.093
  -0.656  0.312 -0.605  0.628 -0.073 -0.659 -0.658 -0.503  0.324  0.06
  -0.949  0.006 -0.826 -0.279 -0.249 -0.938  0.166 -1.03  -0.397 -0.074
  -0.224 -0.779  0.28  -0.599 -0.993 -1.21  -1.172  0.522  0.781  0.627
   0.02   0.964  0.134  0.466 -0.597 -0.354  1.265  0.147  0.153 -0.168
  -0.758 -1.047 -0.158 -0.468  0.3   -0.336 -0.627  0.59  -0.605  0.138
   0.853 -0.566  0.453  1.011]
 [ 0.27  -0.235 -1.544 -0.038 -0.234  0.129 -0.311 -0.97  -0.243  0.366
   0.974 -0.945  1.132 -0.288  0.058  0.435 -1.075  0.144 -0.33   0.04
   0.292  0.917  0.534  0.309  0.005  1.361 -0.053  1.036  0.357  0.03
  -1.367  1.261  0.283 -1.16   0.064  0.076  0.333 -0.472  0.25   0.676
   0.645 -0.466 -0.35   0.288 -0.622 -0.433  0.905 -0.78  -0.267  0.335
  -0.605  1.117  1.436 -0.976  0.721 -0.072 -0.166  0.609  0.454 -0.149
  -0.154 -0.668  0.161 -0.913]
 [ 0.581  0.029 -0.121  0.056 -0.817  0.408  0.55   0.592  0.344  0.392
  -0.727  0.332 -0.747  0.111  0.192 -0.743  0.444  0.83   0.271  0.468
  -0.137 -0.249  0.421 -0.437 -0.49   0.057  1.117  0.208 -0.024  0.32
  -0.607 -1.125 -0.129  0.433 -0.024  0.241 -0.815 -0.317 -0.91  -0.111
   1.066  0.623 -0.157 -0.078  0.192  0.671 -0.932  0.325 -0.445  0.267
  -0.302 -0.772 -0.207  0.422 -0.264  0.245 -0.51   0.083 -0.671  0.297
  -0.732 -0.329  0.484  0.396]
 [-0.164 -0.309  0.665  0.228  0.476  0.356  0.508 -0.426  0.738 -1.743
   0.765 -1.036  0.976  0.177  0.305 -0.756 -0.159 -0.67   0.28  -0.417
   0.269 -1.157  0.675  0.042 -0.181  0.131  0.475  0.959 -0.266 -0.379
  -0.019  0.514 -0.391  0.601 -0.263  0.054  0.207  0.211 -0.374  1.04
  -0.138  0.116 -0.39   0.163 -0.706 -0.45  -0.003 -0.221 -0.198 -0.205
   0.011 -0.605  0.984 -0.19   0.639 -0.172  0.391  0.23  -0.836  0.388
  -0.843  0.875  1.419  1.416]]

==============================
K
==============================
[[-0.758  0.836  1.233 -0.518  1.507 -0.171  0.372  1.24  -0.012  0.445
  -0.646  0.797 -1.395 -0.258 -0.864  1.3    0.911  0.249  0.716  0.063
  -0.287  0.377 -0.545 -0.265  0.266 -1.027 -0.283 -0.82  -1.036 -0.487
   0.696  0.463 -0.023 -0.2    0.003 -0.34   0.028 -0.417 -0.254  0.457
  -0.762  0.02   0.358 -0.576  0.728 -0.    -0.277  0.591  0.231 -0.004
   0.678  0.182 -0.355  0.832 -0.75  -0.295  0.176 -0.704  0.651 -0.327
  -0.286 -0.435 -0.667 -0.421]
 [-0.367  0.32  -0.186 -0.378 -0.099  0.194  0.173 -0.819 -0.245  0.213
   0.657  0.48  -0.556 -0.199  0.147  0.707 -0.712 -0.05  -0.426 -0.776
  -0.975 -0.009  1.464 -0.459  0.894 -0.531  0.331  0.945 -0.772 -0.308
  -1.053  0.235 -0.061 -0.689  0.808  1.045 -1.145 -0.563  0.35   0.508
   0.531 -0.083 -1.198 -0.17   0.603 -0.352  0.115 -0.765 -0.331 -0.01
  -0.155 -0.298  0.175 -0.224  0.238 -0.13  -0.169  0.69  -1.039 -0.211
  -0.962 -0.051  0.119  0.095]
 [-1.209 -0.033  0.01   0.999 -0.245 -0.039  0.07  -0.573 -0.168 -0.506
   0.262 -0.775  0.129  0.323 -0.973 -0.38  -0.042 -0.542 -0.859  0.274
   0.533 -0.437 -0.169 -0.279 -0.091 -0.663  0.257 -0.414 -0.418  0.152
   0.11   0.647 -0.328 -0.307  0.3   -0.767 -0.319  0.01  -0.304  0.964
   0.459 -0.167  0.375 -0.713 -0.56  -0.774  0.093  0.444  0.294 -0.894
  -0.624 -0.389  0.906 -0.127  0.104 -0.849  0.852 -0.373  0.257 -0.362
  -0.849 -0.226 -0.101  0.524]
 [-0.113  0.098 -1.399 -0.02  -0.138 -0.146 -0.012  0.238 -0.635  0.471
  -0.496  0.014  0.286 -0.013 -0.101  0.229 -1.211 -0.011  0.656  0.288
   0.015  1.026 -0.975 -0.086  0.849 -0.453 -0.008 -0.421  0.008  0.809
   0.159  0.371 -0.331 -0.861  0.488 -0.312  0.251 -0.362 -0.15   0.226
  -0.11  -1.087  0.258 -0.936  1.026 -0.263 -0.249 -0.646  0.053  0.126
   0.028  0.064 -0.741  0.314  0.544 -0.04  -0.123  0.305 -0.596 -0.908
  -0.176 -0.341 -0.923  0.163]]

==============================
V
==============================
[[-0.67  -0.117 -1.085 -0.536  1.016  0.047  0.154 -0.087  0.926 -0.14
   0.781 -0.443  0.172 -0.433 -0.567  0.666 -0.321  0.173 -0.428 -0.357
  -0.559 -0.221  0.138 -0.469  0.326 -0.691  0.647  0.108  0.062 -0.926
  -0.424 -0.87   0.053 -1.048  0.652 -0.798 -0.27   0.211 -0.281 -0.317
   0.637 -0.954 -0.274 -0.991  0.6    0.093  0.044 -1.527 -0.575 -0.087
   0.494 -0.896  0.691 -1.437 -0.675 -0.253 -0.193 -0.758 -0.29  -1.044
  -0.631 -0.194  0.17   0.962]
 [-0.193  0.785  0.281  1.047 -0.778 -0.042  0.493 -0.989 -0.119  0.309
  -0.724 -0.085 -0.068 -0.205  0.213  1.236  0.288  1.454 -0.181 -0.452
   0.334  0.37  -0.163  0.507  0.01   0.317  0.442  0.108  0.258 -0.605
   0.597  0.36  -0.311  0.552 -0.37   0.423  1.258 -0.014  0.607  0.109
   0.078 -0.015 -0.743  0.727  0.526  0.137  0.076  0.581 -0.797 -0.633
  -1.153  0.331 -0.329  0.178 -0.325 -1.605  0.141  0.207  1.123 -0.171
  -0.499  0.082 -0.06  -0.245]
 [-0.05   0.524 -0.457 -0.141 -0.514 -0.749  0.284 -0.071  0.072 -0.255
  -0.066 -0.178  0.629  0.028  0.027 -0.088 -0.594  0.894 -0.834  0.163
  -0.85  -0.203  0.152 -0.006 -0.053  0.338 -0.117  0.177 -0.059 -0.246
   0.522 -0.267  0.842  0.308 -0.153  0.036 -0.091  0.693  0.592  0.425
   0.679  0.323  0.123 -0.902  0.734  0.215 -0.396 -0.322  0.013  0.511
   0.352 -0.016 -0.481  0.078  0.021 -0.197  0.624 -0.753 -0.083  0.389
   0.226  0.331  0.727 -0.86 ]
 [-0.219  0.152 -0.382  0.197 -0.753 -0.36  -0.544 -0.78   0.7   -0.306
  -0.269  0.145 -0.222  0.287 -0.851 -0.078 -0.227  0.145 -0.272  0.687
   0.432  0.179 -1.144  0.775 -1.057 -0.776  0.556  0.788  0.059  1.002
  -1.004 -0.037 -0.441 -0.011  0.488 -0.348 -0.531  0.883  0.063 -0.765
  -0.091  0.13  -0.484 -1.145  0.331  0.787  0.875 -0.226  0.819 -0.523
  -1.217 -0.296 -0.068 -0.378  0.773  0.452 -0.011 -0.157  0.278 -0.372
  -0.036  0.179 -0.251  0.309]]

==============================
scores = QK^T
==============================
[[ -4.192   0.67    3.931   1.269]
 [-13.375   6.972   1.766   3.113]
 [ -2.067   2.363  -2.743  -2.388]
 [ -9.26    3.297   6.765  -7.769]]

==============================
scores_scaled
==============================
[[-0.524   -inf   -inf   -inf]
 [-1.672  0.871   -inf   -inf]
 [-0.258  0.295 -0.343   -inf]
 [-1.158  0.412  0.846 -0.971]]

==============================
softmax alpha
==============================
[[1.    0.    0.    0.   ]
 [0.073 0.927 0.    0.   ]
 [0.273 0.475 0.251 0.   ]
 [0.069 0.333 0.514 0.084]]
```

The above code implements a miniature **GPT (autoregressive language model)**: given "你" it predicts "好", given "你好" it predicts "世", and so on. After 1000 epochs of training, the cross-entropy loss dropped from 2.033882 to 0.000001, and the model precisely learned the shift relationship of the sequence `['你', '好', '世', '界']` → `['好', '世', '界', '你']`.

The model strictly conforms to the standard mathematical definition of **Scaled Dot-Product Attention**:

$$
\mathrm{Attention}(Q,K,V) =
\mathrm{softmax}\left( \frac{Q K^\top}{\sqrt{d\_k}} \right) V
$$

The correspondence in the code is as follows.

**Q, K, V Generation**

$$
Q = W\_Q , x,\quad K = W\_K , x,\quad V = W\_V , x
$$

In the source code, `Q = self.W_Q(x)`, `K = self.W_K(x)`, `V = self.W_V(x)` — all three are obtained through linear projection from the same input `x`, which is self-attention. Since:

$$
d\_{\mathrm{model}} = 512, \quad n\_{\mathrm{heads}} = 8
$$

$$
d\_k = d\_{\mathrm{head}} = \frac{d\_{\mathrm{model}}}{n\_{\mathrm{heads}}} = \frac{512}{8} = 64
$$

The projection results are split into 8 heads, each with dimension

$$
d\_k = 64
$$

(code: `Q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)`).

The output only shows Q, K, V, scores, and alpha for the first attention head (`head=0`), allowing readers to trace the complete computation chain.

**Similarity and Scaling**

$$
S = Q K^\top, \qquad S\_\mathrm{scaled} = \frac{S}{\sqrt{d\_k}}
$$

Corresponding to the source code `scores = Q @ K.transpose(-2, -1)` and `scores_scaled = scores / math.sqrt(self.d_head)`.

**Causal Mask**

```python
mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
scores_scaled = scores_scaled.masked_fill(mask, float("-inf"))
```

The upper triangle is set to `-∞`, forcing position `i` to only attend to positions `0,…,i`: this is the fundamental constraint of autoregressive models — when predicting the next token, future information must not be accessed in advance. The `-inf` in the scaled matrix is a direct manifestation of this constraint.

**Softmax Normalization**

$$
A = \mathrm{softmax}(S\_\mathrm{scaled})
$$

Corresponding to `alpha = F.softmax(scores_scaled, dim=-1)`. Since

$$
e^{-\infty} = 0
$$

the weights at masked positions are zero; the remaining positions are non-negative and sum to 1, forming a valid probability distribution:

$$
\begin{array}{rl}
\text{Position 0 (你)} &: \[1.000,, 0,, 0,, 0] \\
\text{Position 1 (好)} &: \[0.073,, 0.927,, 0,, 0] \\
\text{Position 2 (世)} &: \[0.273,, 0.475,, 0.251,, 0] \\
\text{Position 3 (界)} &: \[0.069,, 0.333,, 0.514,, 0.084]
\end{array}
$$

**Weighted Sum Output**

$$
\mathrm{output} = A V
$$

Corresponding to `out = alpha @ V`, followed by merging heads and projecting output through `W_O`.

**Feed-Forward Network (FFN)**

```python
nn.Linear(512, 2048), nn.ReLU(), nn.Linear(2048, 512)
```

$$
d\_\mathrm{ff} = 4 \times d\_\mathrm{model} = 2048
$$

Consistent with the original paper.

**Residual Connections and Layer Normalization**

```python
x = x + self.attn(self.ln1(x))   # Pre-LayerNorm
x = x + self.ffn(self.ln2(x))
```

Equivalent to the

$$
x + \mathrm{Sublayer}(\mathrm{LayerNorm}(x))
$$

Pre-LN variant.

**Scale Comparison**

This example uses only 4 tokens

$$
d\_\mathrm{model} = 512, \quad n\_\mathrm{layers} = 6, \quad n\_\mathrm{heads} = 8
$$

This is completely consistent with the base configuration of the original Transformer paper, demonstrating that even with an extremely small-scale sequence, as long as the mathematical structure is correct, the Transformer can converge perfectly.

## Exercises

1. What limitation of traditional recurrent neural networks in processing long sequences does the self-attention mechanism in the Transformer architecture address? Please explain the core idea of the attention mechanism in your own words.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://book.bsdcn.org/ask/flat/chapter-20-artificial-intelligence/di-20.2-jie-transformer-shu-xue-ji-chu-yu-cheng-xu-yan-shi.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.