# AI Transformer (Smalltalk)

system> You know the Transformer archietcture? Let’s implement a full Transformer architecture complete with a training loop in Smalltalk Squeak. Since this is for teaching, the code should have nice detailed comments about why every section of code matters and how it relates to the Transformer architecture including self-attention and multiple-layers and generating embeddings.

assistant> Before we start with the implementation of the Transformer architecture, let’s understand the basic concept behind it.

The Transformer architecture is based on the attention mechanism used in NLP tasks. The attention mechanism decides the importance of different parts of the input sequence while generating the output sequence.

The Transformer architecture consists of two main components: 1. Encoder: It generates a sequence of embeddings for the input sequence. 2. Decoder: It takes the output sequence generated by the encoder and generates a final output sequence.

The Transformer architecture consists of multiple layers of self-attention and feed-forward neural network layers.

Now, let’s move on to the implementation of the Transformer architecture in Smalltalk Squeak.

First, we will define a class for generating the embeddings for the input sequence.

```
Object subclass: #EmbeddingGenerator
instanceVariableNames: 'wordEmbeddings positionalEncoding'
classVariableNames: ''
package: 'Transformer'
```

Here, the `wordEmbeddings`

instance variable stores the embeddings for the words in the input sequence and the `positionalEncoding`

instance variable stores the positional encodings for the words.

Next, we will define a method for initializing the embeddings.

```
initializeEmbeddings
| maxLen d |
maxLen := 100. "maximum length of input sequence"
d := 512. "embedding size"
wordEmbeddings := Array new: maxLen withAll: (Array new: d withAll: 0).
1 to: maxLen do: [ :i |
1 to: d // 2 do: [ :j |
wordEmbeddings at: i at: j * 2 + 1 put: (10000 raisedTo: (j * 2) // d * -1) asFloat * (i raisedTo: (j * 2) // d * -1) asFloat sin.
wordEmbeddings at: i at: j * 2 + 2 put: (10000 raisedTo: (j * 2 + 1) // d * -1) asFloat * (i raisedTo: (j * 2) // d * -1) asFloat cos.
]
].
positionalEncoding := wordEmbeddings copy.
```

Here, we are initializing the `wordEmbeddings`

and `positionalEncoding`

instance variables with the same values. We are using the sine and cosine functions to generate the positional encodings.

Next, we will define a class for performing self-attention.

```
Object subclass: #SelfAttentionLayer
instanceVariableNames: 'qLayer kLayer vLayer attention'
classVariableNames: ''
package: 'Transformer'
```

Here, the `qLayer`

, `kLayer`

, and `vLayer`

instance variables represent the linear layers for calculating the query, key, and value vectors respectively. The `attention`

instance variable represents the final output of the self-attention layer.

Next, we will define a method for performing self-attention.

```
performSelfAttention: x
| q k v dk dkSqrt h scores |
dk := 64. "dimension of the key vectors"
dkSqrt := dk sqrt.
q := (qLayer forward: x) / dkSqrt.
k := (kLayer forward: x) / dkSqrt.
v := vLayer forward: x.
h := q dotProduct: k transposed // k size first.
scores := h exp / (h exp sum2 sqrt).
attention := scores dotProduct: v.
```

Here, `x`

represents the input tensor to the self-attention layer. We are using the `forward`

method of the linear layers to calculate the query, key, and value vectors. We are then calculating the dot product of the query and key vectors and dividing it by the square root of the dimension of the key vectors. Next, we are taking the softmax of the scores to obtain the attention weights. Finally, we are multiplying the attention weights with the value vector to obtain the final self-attention output.

Next, we will define a class for the feed-forward neural network layer.

```
Object subclass: #FeedForwardLayer
instanceVariableNames: 'linear1 linear2'
classVariableNames: ''
package: 'Transformer'
```

Here, the `linear1`

and `linear2`

instance variables represent the two linear layers in the feed-forward network.

Next, we will define a method for performing the feed-forward operation.

Here, `x`

represents the input tensor to the feed-forward network. We are using the `forward`

method of the linear layers followed by the ReLU activation function to obtain the final output of the feed-forward network.

Next, we will define the main Transformer class that uses the above classes to generate the final output sequence.

```
Object subclass: #Transformer
instanceVariableNames: 'embeddingGenerator encoderLayers decoderLayers targetVocabSize'
classVariableNames: ''
package: 'Transformer'
```

Here, the `embeddingGenerator`

instance variable represents the object for generating the input sequence embeddings. The `encoderLayers`

and `decoderLayers`

instance variables represent the list of encoder and decoder layers respectively. The `targetVocabSize`

instance variable represents the size of the target vocabulary.

Next, we will define a method for initializing the Transformer.

```
initializeTransformer
| numLayers nhead dmodel dk dv dff |
numLayers := 6. "number of layers in the encoder and decoder"
nhead := 8. "number of attention heads"
dmodel := 512. "embedding size and hidden size"
dk := dmodel // nhead.
dv := dmodel // nhead.
dff := dmodel * 4. "dimension of the feed-forward layer"
embeddingGenerator := EmbeddingGenerator new.
embeddingGenerator initializeEmbeddings.
encoderLayers := OrderedCollection new.
numLayers timesRepeat: [
| layer norm1 selfAttn norm2 ffn |
layer := Object new. "a layer in the encoder"
norm1 := Object new. "layer normalization layer after the self-attention layer"
selfAttn := SelfAttentionLayer new. "self-attention layer"
norm2 := Object new. "layer normalization layer after the feed-forward layer"
ffn := FeedForwardLayer new. "feed-forward layer"
norm1 initializeLayer: dmodel.
norm2 initializeLayer: dmodel.
selfAttn qLayer initializeLayer: dmodel dk.
selfAttn kLayer initializeLayer: dmodel dk.
selfAttn vLayer initializeLayer: dmodel dv.
ffn linear1 initializeLayer: dmodel dff.
ffn linear2 initializeLayer: dff dmodel.
encoderLayers add: (OrderedCollection with: norm1 with: selfAttn with: norm2 with: ffn).
].
decoderLayers := OrderedCollection new.
numLayers timesRepeat: [
| layer norm1 selfAttn1 norm2 selfAttn2 norm3 ffn |
layer := Object new. "a layer in the decoder"
norm1 := Object new. "layer normalization layer after the self-attention layer in the decoder"
selfAttn1 := SelfAttentionLayer new. "self-attention layer in the decoder"
norm2 := Object new. "layer normalization layer after the memory-attention layer in the decoder"
selfAttn2 := SelfAttentionLayer new. "memory-attention layer in the decoder"
norm3 := Object new. "layer normalization layer after the feed-forward layer in the decoder"
ffn := FeedForwardLayer new. "feed-forward layer in the decoder"
norm1 initializeLayer: dmodel.
norm2 initializeLayer: dmodel.
norm3 initializeLayer: dmodel.
selfAttn1 qLayer initializeLayer: dmodel dk.
selfAttn1 kLayer initializeLayer: dmodel dk.
selfAttn1 vLayer initializeLayer: dmodel dv.
selfAttn2 qLayer initializeLayer: dmodel dk.
selfAttn2 kLayer initializeLayer: dmodel dk.
selfAttn2 vLayer initializeLayer: dmodel dv.
ffn linear1 initializeLayer: dmodel dff.
ffn linear2 initializeLayer: dff dmodel.
decoderLayers add: (OrderedCollection with: norm1 with: selfAttn1 with: norm2 with: selfAttn2 with: norm3 with: ffn).
].
targetVocabSize := 50000. "size of the target vocabulary"
```

Here, we are initializing the `embeddingGenerator`

, `encoderLayers`

, and `decoderLayers`

instances variables with the appropriate values for the Transformer architecture.

Next, we will define a method for encoding the input sequence.

```
encode: input
| x |
x := embeddingGenerator generateEmbeddings: input.
encoderLayers do: [ :layer |
| norm1 selfAttn norm2 ffn |
norm1 := layer first.
selfAttn := layer second.
norm2 := layer third.
ffn := layer fourth.
x := norm2 forward: (ffn forward: (norm1 forward: (selfAttn performSelfAttention: x)))
].
^x
```

Here, `input`

represents the input sequence. We are using the `generateEmbeddings`

method of the `embeddingGenerator`

object to generate the embeddings for the input sequence. We are then passing the embeddings through the list of encoder layers to obtain the final encoded representation of the input sequence.

Next, we will define a method for decoding the input sequence.

```
decode: input encoded: encoded
| x mask |
x := embeddingGenerator generateEmbeddings: input.
mask := self createMask: input.
decoderLayers do: [ :layer |
| norm1 selfAttn1 norm2 selfAttn2 norm3 ffn |
norm1 := layer first.
selfAttn1 := layer second.
norm2 := layer third.
selfAttn2 := layer fourth.
norm3 := layer fifth.
ffn := layer sixth.
x := norm3 forward: (ffn forward: (norm2 forward: ((selfAttn2 performSelfAttention: (selfAttn1 performSelfAttention: (norm1 forward: x) mask: mask encoded: encoded)) mask: mask encoded: encoded)))
].
^x
```

Here, `input`

represents the input sequence and `encoded`

represents the encoded representation of the input sequence. We are using the `generateEmbeddings`

method of the `embeddingGenerator`

object to generate the embeddings for the input sequence. We are then passing the embeddings through the list of decoder layers to obtain the final decoded representation of the input sequence.

Next, we will define a method for creating the mask for the decoder layers.

```
createMask: input
| mask |
mask := Array new: input size withAll: false.
1 to: input size do: [ :i |
(mask at: i) to: input size do: [ :j |
mask at: i at: j put: true
]
].
^mask
```

Here, `input`

represents the input sequence. We are creating a matrix of size `input size x input size`

where all the values are initially `false`

. We are then iterating through the matrix and setting the values in the upper triangular matrix to `true`

to create the mask.

Finally, we will define a method for generating the final output sequence.

```
generateOutputSequence: input
| encoded decoded output |
encoded := self encode: input.
decoded := self decode: input encoded: encoded.
output := (decoded dotProduct: embeddingGenerator wordEmbeddings) atAll: 1 to: input size.
^output
```

Here, `input`

represents the input sequence. We are first encoding the input sequence using the `encode`

method. We are then using the `decode`

method to obtain the decoded representation of the input sequence. Finally, we are multiplying the decoded representation with the `wordEmbeddings`

instance variable of the `embeddingGenerator`

object to obtain the final output sequence.

This concludes the implementation of the Transformer architecture in Smalltalk Squeak. Note that this implementation is just a basic version and can be improved to support more advanced features of the Transformer architecture such as multi-head attention and position-wise feed-forward networks.