We use to measure the difference between the model's predicted probability distribution and the actual next token (which is represented as a one-hot vector). The goal of training is to minimize this loss.

You cannot use Hugging Face’s tokenizers library for this step if you truly want "from scratch." You must parse UTF-8 bytes and build the frequency map manually. A good PDF provides the Python loops for this, handling edge cases like Unicode emojis ( 😊 splitting into \xf0\x9f\x98\x8a ).

att_scores = (Q @ K.transpose(-2, -1)) / (self.d_head ** 0.5) att_scores = att_scores.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf')) att_weights = F.softmax(att_scores, dim=-1)