# Chapter 6: Self-Attention and Multi-Head Attention in Transformers

## 6.4 The Mathematics of Attention

Understanding the mathematics of attention is a fascinating subject that can offer a deeper insight into the workings of this mechanism. By exploring the mathematical concepts behind attention, one can gain a better understanding of why it functions the way it does and how it can be applied to different scenarios.

This knowledge will undoubtedly prove useful for anyone who is interested in conducting research in this field, as well as those who are looking to develop their own variations of the attention mechanism. In addition, learning about the mathematics of attention can also help individuals to develop a more comprehensive understanding of the wider field of cognitive science, which encompasses a range of related topics such as perception, memory, and learning.

By gaining a deeper understanding of these concepts, individuals can apply this knowledge to their own work, whether it be in academia or in industry, and potentially make significant contributions to the field of cognitive science.

### 6.4.1 Scoring Function

The first thing to note is the scoring function used in the self-attention mechanism. The score is calculated as the dot product between the Query and Key vectors:

`s_i,j = Q_i * K_j`

This score represents the relevance of the j-th input to the i-th output.

The dot product scoring function is not the only one that could be used. For instance, another scoring function could be the scaled dot product (as used in the Transformer model), which scales the scores by the square root of the dimension of the keys:

`s_i,j = (Q_i * K_j) / sqrt(d_k)`

This scaling is done because for large values of `d_k`

, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

### 6.4.2 Softmax Function

The softmax function is used to convert the scores into weights that sum to one. This function is given by:

`softmax(x)_i = exp(x_i) / sum(exp(x_j) for j in range(n))`

where `x`

is the vector of scores, `exp`

is the exponential function, and `n`

is the dimension of `x`

. The softmax function ensures that the weights are all positive and sum to one, which is necessary for them to represent probabilities.

### 6.4.3 Weighted Sum

The final step in the self-attention mechanism is computing the weighted sum of the Value vectors. This is done by multiplying each Value vector by its corresponding weight and summing the results:

`O_i = sum(w_i,j * V_j for j in range(n))`

where `O_i`

is the i-th output, `w_i,j`

is the weight of the j-th input for the i-th output, and `n`

is the number of inputs.

The weighted sum ensures that the output is a mixture of the Value vectors, with the degree of contribution from each Value vector determined by its weight. This is the step that allows the model to focus more on relevant inputs and less on irrelevant ones.

**Example:**

Here's the complete process coded in Python for a better understanding:

`import numpy as np`

def self_attention(Q, K, V):

# Calculate scores

d_k = K.shape[-1]

scores = np.matmul(Q, K.T) / np.sqrt(d_k)

# Apply softmax to get weights

weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)

# Compute output

output = np.matmul(weights, V)

return output

# Test the function with random data

Q = np.random.rand(2, 3) # Query

K = np.random.rand(2, 3) # Key

V = np.random.rand(2, 3) # Value

output = self_attention(Q, K, V)

print("Output:", output)

## 6.4 The Mathematics of Attention

Understanding the mathematics of attention is a fascinating subject that can offer a deeper insight into the workings of this mechanism. By exploring the mathematical concepts behind attention, one can gain a better understanding of why it functions the way it does and how it can be applied to different scenarios.

This knowledge will undoubtedly prove useful for anyone who is interested in conducting research in this field, as well as those who are looking to develop their own variations of the attention mechanism. In addition, learning about the mathematics of attention can also help individuals to develop a more comprehensive understanding of the wider field of cognitive science, which encompasses a range of related topics such as perception, memory, and learning.

By gaining a deeper understanding of these concepts, individuals can apply this knowledge to their own work, whether it be in academia or in industry, and potentially make significant contributions to the field of cognitive science.

### 6.4.1 Scoring Function

The first thing to note is the scoring function used in the self-attention mechanism. The score is calculated as the dot product between the Query and Key vectors:

`s_i,j = Q_i * K_j`

This score represents the relevance of the j-th input to the i-th output.

The dot product scoring function is not the only one that could be used. For instance, another scoring function could be the scaled dot product (as used in the Transformer model), which scales the scores by the square root of the dimension of the keys:

`s_i,j = (Q_i * K_j) / sqrt(d_k)`

This scaling is done because for large values of `d_k`

, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

### 6.4.2 Softmax Function

The softmax function is used to convert the scores into weights that sum to one. This function is given by:

`softmax(x)_i = exp(x_i) / sum(exp(x_j) for j in range(n))`

where `x`

is the vector of scores, `exp`

is the exponential function, and `n`

is the dimension of `x`

. The softmax function ensures that the weights are all positive and sum to one, which is necessary for them to represent probabilities.

### 6.4.3 Weighted Sum

The final step in the self-attention mechanism is computing the weighted sum of the Value vectors. This is done by multiplying each Value vector by its corresponding weight and summing the results:

`O_i = sum(w_i,j * V_j for j in range(n))`

where `O_i`

is the i-th output, `w_i,j`

is the weight of the j-th input for the i-th output, and `n`

is the number of inputs.

The weighted sum ensures that the output is a mixture of the Value vectors, with the degree of contribution from each Value vector determined by its weight. This is the step that allows the model to focus more on relevant inputs and less on irrelevant ones.

**Example:**

Here's the complete process coded in Python for a better understanding:

`import numpy as np`

def self_attention(Q, K, V):

# Calculate scores

d_k = K.shape[-1]

scores = np.matmul(Q, K.T) / np.sqrt(d_k)

# Apply softmax to get weights

weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)

# Compute output

output = np.matmul(weights, V)

return output

# Test the function with random data

Q = np.random.rand(2, 3) # Query

K = np.random.rand(2, 3) # Key

V = np.random.rand(2, 3) # Value

output = self_attention(Q, K, V)

print("Output:", output)

## 6.4 The Mathematics of Attention

Understanding the mathematics of attention is a fascinating subject that can offer a deeper insight into the workings of this mechanism. By exploring the mathematical concepts behind attention, one can gain a better understanding of why it functions the way it does and how it can be applied to different scenarios.

This knowledge will undoubtedly prove useful for anyone who is interested in conducting research in this field, as well as those who are looking to develop their own variations of the attention mechanism. In addition, learning about the mathematics of attention can also help individuals to develop a more comprehensive understanding of the wider field of cognitive science, which encompasses a range of related topics such as perception, memory, and learning.

By gaining a deeper understanding of these concepts, individuals can apply this knowledge to their own work, whether it be in academia or in industry, and potentially make significant contributions to the field of cognitive science.

### 6.4.1 Scoring Function

The first thing to note is the scoring function used in the self-attention mechanism. The score is calculated as the dot product between the Query and Key vectors:

`s_i,j = Q_i * K_j`

This score represents the relevance of the j-th input to the i-th output.

The dot product scoring function is not the only one that could be used. For instance, another scoring function could be the scaled dot product (as used in the Transformer model), which scales the scores by the square root of the dimension of the keys:

`s_i,j = (Q_i * K_j) / sqrt(d_k)`

This scaling is done because for large values of `d_k`

, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

### 6.4.2 Softmax Function

The softmax function is used to convert the scores into weights that sum to one. This function is given by:

`softmax(x)_i = exp(x_i) / sum(exp(x_j) for j in range(n))`

where `x`

is the vector of scores, `exp`

is the exponential function, and `n`

is the dimension of `x`

. The softmax function ensures that the weights are all positive and sum to one, which is necessary for them to represent probabilities.

### 6.4.3 Weighted Sum

The final step in the self-attention mechanism is computing the weighted sum of the Value vectors. This is done by multiplying each Value vector by its corresponding weight and summing the results:

`O_i = sum(w_i,j * V_j for j in range(n))`

where `O_i`

is the i-th output, `w_i,j`

is the weight of the j-th input for the i-th output, and `n`

is the number of inputs.

The weighted sum ensures that the output is a mixture of the Value vectors, with the degree of contribution from each Value vector determined by its weight. This is the step that allows the model to focus more on relevant inputs and less on irrelevant ones.

**Example:**

Here's the complete process coded in Python for a better understanding:

`import numpy as np`

def self_attention(Q, K, V):

# Calculate scores

d_k = K.shape[-1]

scores = np.matmul(Q, K.T) / np.sqrt(d_k)

# Apply softmax to get weights

weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)

# Compute output

output = np.matmul(weights, V)

return output

# Test the function with random data

Q = np.random.rand(2, 3) # Query

K = np.random.rand(2, 3) # Key

V = np.random.rand(2, 3) # Value

output = self_attention(Q, K, V)

print("Output:", output)

## 6.4 The Mathematics of Attention

### 6.4.1 Scoring Function

`s_i,j = Q_i * K_j`

This score represents the relevance of the j-th input to the i-th output.

`s_i,j = (Q_i * K_j) / sqrt(d_k)`

`d_k`

, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

### 6.4.2 Softmax Function

`softmax(x)_i = exp(x_i) / sum(exp(x_j) for j in range(n))`

`x`

is the vector of scores, `exp`

is the exponential function, and `n`

is the dimension of `x`

. The softmax function ensures that the weights are all positive and sum to one, which is necessary for them to represent probabilities.

### 6.4.3 Weighted Sum

`O_i = sum(w_i,j * V_j for j in range(n))`

`O_i`

is the i-th output, `w_i,j`

is the weight of the j-th input for the i-th output, and `n`

is the number of inputs.

**Example:**

Here's the complete process coded in Python for a better understanding:

`import numpy as np`

def self_attention(Q, K, V):

# Calculate scores

d_k = K.shape[-1]

scores = np.matmul(Q, K.T) / np.sqrt(d_k)

# Apply softmax to get weights

weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)

# Compute output

output = np.matmul(weights, V)

return output

# Test the function with random data

Q = np.random.rand(2, 3) # Query

K = np.random.rand(2, 3) # Key

V = np.random.rand(2, 3) # Value

output = self_attention(Q, K, V)

print("Output:", output)