Question 1

What is self-attention?

Accepted Answer

Self-attention lets each element in a sequence compute a weighted sum of all other elements, determining which parts of the input are most relevant. This captures long-range dependencies without sequential processing and is the key innovation that makes transformers powerful.

Question 2

What is the difference between BERT and GPT?

Accepted Answer

BERT uses the transformer encoder and is trained to understand text (fill in blanked-out words). GPT uses the transformer decoder and is trained to generate text (predict the next word). BERT excels at classification and search; GPT excels at text generation and conversation.

Question 3

Why are transformers so computationally expensive?

Accepted Answer

Self-attention computes relationships between all pairs of elements, making computation quadratic in sequence length. Long sequences require enormous memory and compute. Techniques like sparse attention and flash attention reduce this cost.

Transformer Model Explained

Explanation

Bookuvai Implementation

Key Facts

Related Terms

Frequently Asked Questions