LANGUAGE MODELS USING MLP (Part 2)

Welcome back! Today we will train our multi-layer perceptron, as well as exploring some techniques to fine-tune the model. Remember the terrifying loss that we have in our previous chapter? Keep that in mind, we will try our best to minimize that loss …


This content originally appeared on DEV Community and was authored by Hưng Lê Tiến

Welcome back! Today we will train our multi-layer perceptron, as well as exploring some techniques to fine-tune the model. Remember the terrifying loss that we have in our previous chapter? Keep that in mind, we will try our best to minimize that loss in this blog.

CROSS-ENTROPY

We first introduce to you a new evaluation metric called cross_entropy. Let's see what it actually is:

F.cross_entropy(logits,Y)
tensor(14.3920)

It is the same with our negative log likelihood! In fact, the cross-entropy is just a short implementation of our loss. Rather than hard-coding all of the intermediate procedures, we wrap everything up on one line of code, after calculating our logits:

# Calculating the logits
logits = h @ W2 + b2

# Our previous code
counts = logits.exp()
prob = counts / counts.sum(1,keepdims = True)
loss =  -prob[torch.arange(32),Y].log().mean()

# New version
F.cross_entropy(logits,Y)

The function cross_entropy actually offers convenience beyond that:

  1. By eliminating the intermediate steps, we also save tons of memory. To be clear, in our previous code, we created numerous intermediate tensors for storing the counts, then the prob, then eventually the loss, that itself represents a waste of memory. In the cross_entropy function, however, there's no intermediate steps (at least that's what I was told), so there's a huge saving in the memory.
  2. The function also allows for more efficient backward pass. We don't have to pass through numerous intermediate functions like in our previous code, instead, we just need to backward through a single computation in the cross_entropy, which is more efficient and time-saving.
  3. It is more numerically well-behave. This is the thing that we really need to dive into. So let us spend some minutes talking about what is numerically well-behaved.

Let's consider the prob of a random tensor:

logits = torch.tensor([-100,-3,0,5])-6
counts = logits.exp()
probs = counts / counts.sum()
probs
tensor([1.5123e-08, 3.3311e-04, 6.6906e-03, 9.9298e-01])

So far so good, no problems here. But let's see what happens when we have some positive values in our tensor:

logits = torch.tensor([-13,-3,0,100])-6
counts = logits.exp()
probs = counts / counts.sum()
probs
tensor([0., 0., 0., nan])

There's something wrong! Further scrutinize the counts, we will find that:

>>>counts
tensor([5.6028e-09, 1.2341e-04, 2.4788e-03,        inf])

The count for the value 100 goes to infinity! And that is understandable because previously we took the exponential of each values, so when it encounters some huge positive numbers, it will have the tendency to blow up, which is often called the numerical overflow (also note that it performs well with the negative numbers).

We don't want that bug to appear in our project, so we need to find a way to scale down all of the values. This involves a subtle detail in the process of constructing the softmax: The output doesn't change when we offset our logits by a constant value. Take a moment to think about that, it is fairly easy to explain mathematically.

With that in mind, then the simplest thing we can do to solve the problem is just subtract the maximum value in the whole tensor. From then, we will have a bunch of negative numbers and we're happy to deal with that.

Turning back to our function cross_entropy, the operation we mentioned above is conveniently included in the function itself, so the function actually does wonders for us in terms of convenience.

Now we shall turn back to our project.

Putting Things Together

During our course of implementing our project, we initializing tensors and functions in a kind of a thinking flow, and thus our code is rather messy and unorganized. In practice, organizing the project is crucial for readability and debugging later on. Moreover, when we want to make modifications to our model, we would know exactly where to go, so if you're working on your project, don't be lazy, take time to put things together as it will be immensely useful later on.

Initializing all the parameters

We put all of the matrices that we have to initialize in a single block, and we also define the seed for the Generator, so that we can reproduce our code.

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 100), generator=g)
b1 = torch.randn(100, generator=g)
W2 = torch.randn((100, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

Now we shall build our model

Build the model

First, remember to set requires_grad = True so that we can implement our backward pass.

for p in parameters:
  p.requires_grad = True

And then we build the model with forward and backward pass, we also need to update the parameters:

for i in range(100):

  # forward pass
  emb = C[X] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Y)
  print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  for p in parameters:
    p.data += -0.1 * p.grad

Run the code

So now we shall run the code and see what happens. Here's the last few results:

0.26593196392059326
0.26574623584747314
0.265565425157547
0.26538926362991333
0.2652176320552826
0.2650502920150757

Astonishing? No, remember that the data is small (only 32 windows), and when the model arrives at an extremely small loss, we won't praise in, but we would rather think that the model is overfitting.

Overfitting means that the model tries to memorize the data rather than capture the underlying meaning. So even though the model is extremely good in their own training set, it would perform poorly on the real-world data, where it gets to see things that it haven't learnt before, and it cannot do anything with it since everything it've done so far is pure memorization (imagine this as a student studying in a class and then taking a test, then you can really understand what I mean).

But why the loss doesn't converge to 0? If the model gets to memorize everything in the training set, so why it doesn't have the 100% accuracy? Normally, if the model is overfitting like this, we will surely get a loss of zero. However, specifically in this project, there is a subtle detail that prevents our network from guessing accurately.

Let's look at our data again:

... ---> e
..e ---> m
.em ---> m
emm ---> a
mma ---> .
... ---> o
..o ---> l
.ol ---> i
oli ---> v
liv ---> i
ivi ---> a
via ---> .
... ---> a
..a ---> v
.av ---> a
ava ---> .

Can you notice the key here? Take a guess.

So here's the answer: Look at the instance ..., denoting the start of word, you just cannot guess the character that comes after it! Try it yourself. It is totally random. So that's where our model struggled.

Working with the whole dataset

Lately we've been dealing with mere few words in the dataset, so let's go big this time.

From the code that constructs our dataset, we should remove the words[:5] and pass in everything.

# rebuild the dataset
block_size = 3 
X, Y = [], []

# Now we iterate through all the words
for w in words:
  context = [0] * block_size
  for ch in w + '.':
    ix = stoi[ch]
    X.append(context)
    Y.append(ix)
    #print(''.join(itos[i] for i in context), '--->', itos[ix])
    context = context[1:] + [ix] # crop and append

X = torch.tensor(X)
Y = torch.tensor(Y)
print(X.shape, Y.shape)

Now let's look at the size of our data

torch.Size([228146, 3]) torch.Size([228146])

That's huge.

Also, when calculating the loss make sure to remove the number 32 and replace it by the appropriate dimension. That's actually the reason why we always try to prevent hard-coding some numbers, it will save a lot of time going around and fixing minor bugs.

loss =  -prob[torch.arange(X.shape[0]),Y].log().mean()

Now we should run our model and see what happens, I recommend you to set the number of iterations to 10, we will discuss shortly about that.
Here's the results:

19.505226135253906
17.08449363708496
15.776531219482422
14.833340644836426
14.002603530883789
13.253260612487793
12.57991886138916
11.983101844787598
11.47049331665039
11.051856994628906

When running the code, it turns out the the model become slower, and that's understandable because it has to calculate the gradient for the whole dataset of 220000 instances, it's really a big deal. And we're just iterating 10 times, so this approach is really questionable when scaling.

A clever way to combat this is passing the input in mini batches, rather than the whole batch of data. So we're dealing with a group of instances at a time, not every single one. This approach is, surely, not as accurate as the batch ones, but it is more time-saving and puts less strain on the computer, and actually it is widely used in practice. In other words, it's much better to approximate the gradient and take much more steps, rather than taking the exact gradient and fewer steps.

So let us construct the mini-batch: We will randomly initialized tensors with 32 values ranging from zero to our number of training instances. Here's how it goes:

torch.randint(0,X.shape[0],(32,))
tensor([200670, 191458, 142413, 156993, 217108, 174176, 143298,  30653, 148878,
        158381,  11828,  75183, 115824,  49455,  91737, 216958, 142564, 224086,
         73948, 217108, 174951, 170926, 180371, 224631, 167595, 173195, 116182,
        192239, 158702,  43879,  45633, 165950])

That looks tasty. Let includes that in our code, and modify our model a little bit so that it will just take out 32 values no more no less.

for p in parameters:
  p.requires_grad = True
for i in range(100):
  # mini batch construct
  ix = torch.randint(0,X.shape[0],(32,))
  # forward pass
  emb = C[X[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Y[ix])
  print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  for p in parameters:
    p.data += -0.1 * p.grad

Here's some few last loss:

3.3569700717926025
3.668323516845703
3.5134356021881104
3.7804458141326904

Promising! And it takes just a few seconds to arrive at that loss.

Learning Rate

Let's look at the implementation again, there's actually one thing that we haven't talk about yet. It's the number 0.1 we multiply by when we update the parameters. It is called the learning rate, and we will spend time discussing about it in this section.

The learning rate, as you can easily notice by yourself, is simply a number to control the length of our step. If the learning rate is too low, which means that we take small steps, then it will take a huge number of iterations to converge to the minimum. By contrast, if the learning rate is too high, indicating big step size, then we will potentially "slip" through the minimum and end up bouncing back and forth without really converging but rather diverge. If it is hard for you to visualize, here's a cool visualization that might help:

Learning rate

*Note: Actually the image is from a really dedicated blog post about learning rate (100x better than mine), I recommend you to read the blog as it is so deep-dive into the topic. Here's the link.

Learning rate decay

There are many ways to set up the learning rate during training in order to maximize the power of the model. The simplest one, which we will do in our project, is perhaps the step decay (or the learning rate decay). The technique involves decreasing the learning rate after a fixed number of epochs (like 100000), thus making the model move slower when the loss is small and potentially near the minimum. This is kind of intuitive: we will take big steps when we're far from the min, and smaller ones when we're close.

Even though there are numerous techniques for step decay, we will still keep it simple by just dividing the learning rate by 10 after each epoch of 10000 iterations.

Find the right learning rate

Now we now the importance of choosing the right learning rate and adjust it so that the model performs well. But how exactly can we do that?

We will follow Andrej's way in finding the good learning rate, actually it's not the most optimal but it's really intuitive and we can easily follow all the steps. First we have to determine a range of possible good values. There would be some points where the learning rate is too low and the model is not really good at decreasing the loss, and there would be points where the learning rate is too high and the loss is bouncing around. We might see those points when we plug in different values, this is pure trial and error. And after playing around with some values of the learning rate (actually Andrej did this in his video, not me), we have our range: from 0.001 to 1.

The next step is examining every values in each range and see what happens with the loss, we should have our optimal value when the loss is at its minimum. So let's run the model with different values of learning rate, actually we can use Pytorch to construct an array of values in that range

# Want to examine the range from 0.001 to 1
torch.linspace(0.001,1,1000)

Plotting the lr-loss Curve

After having the array we want, let's plug it in our model. Also, we need to construct two lists, storing values of learning rate and the loss respectively, in order to graph things out.
A BIG NOTE: After messing things around in my own project, I have to warn you this so that you won't have this same silly mistake: REMEMBER TO RERUN THE ENTIRE MODEL AGAIN. Don't use the previous Weights and Bias for rerun, you have to INITIALIZE THEM ALL OVER AGAIN. Keep this in mind for every time we rerun the model.
Here's the code:

# Keep track of lr and loss for each iter
lri = []
lossi = []

for p in parameters:
  p.requires_grad = True
for i in range(1000):
  # mini batch construct
  ix = torch.randint(0,X.shape[0],(32,))
  # forward pass
  emb = C[X[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Y[ix])
  print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  lr = lrs[i]
  for p in parameters:
    p.data += -lr * p.grad
  # tracking stats
  lri.append(lr)
  lossi.append(loss.item())

Here comes the plot:

# Plot the lr
plt.plot(lri,lossi)

LR-LOSS plot
This is like a hockey-stick-alike kind of graph, means that it has a steep slope at the origin (nearly like a straight line), and then having small fluctuations along the middle part before bouncing back and forth uncontrollably till the end of the range.

Let's interpret this graph, and you will see that our conclusions so far are all correct. First, notice in the beginning of the range, where the learning rate is small, the loss is not even close to the minimum, indicate that we're not making large enough steps to converge at the minimum. Conversely, in the bulk of the remaining range, where the learning rate increases, the loss bounces between values and it bounces even stronger when the learning rate is near 1, which is an indicator of divergence.

Now our mission is looking at this graph and find the value where the learning rate gets the loss to its minimum. From our eyes we can see that the value is in somewhere around 0.1. If it's not too obvious, then we can plot out the log of the learning rate, so that the range of values can be wider. Let's implement that in our code:

# Take the log of our range of values

lre = torch.linspace(-3,0,1000)
lrs = 10 ** lre

# Intilize a new list 
lrlog = []

# Update this list in when tracking stats
....
  lrlog.append(lre[i])
  lri.append(lr)
  lossi.append(loss.item())

# Plotting things out
plt.plot(lrlog,lossi)

And here's the plot:

LRLOG-LOSS curve
Now it's getting obvious. So we go full circle in this section: The magic number 0.1 in the beginning is actually the optimal learning rate for this model.

And after rerun the model with this optimal learning rate, plus the learning rate decay technique, we have our loss:

2.3681914806365967

It's the best result that we've got so far! It's amazing that just this one small number can create such a huge effect on our model, even when we're playing with the simple methods (guess how crazy it will be when you try some advanced ones, worth to try!).

Splitting the Data

So far we've been getting the model to study a lot, and we also fine-tune it in order to bring down the loss. But does this loss mean that our model will do well in real-world setting? Actually we don't really know, because there's no test yet. How can we create a test for the model?

The answer is that we will stop feeding the model every data points, instead, we will hold back a small portion of the dataset. That's gonna be our test, because the model doesn't get to see those data during training. Normally the split is around 8:2 or 7:3 for the training: testing. The evaluation method in the test set is still our cross_entropy. One thing to keep in mind: If the validation is high while the training loss is low then there's a high chance of overfitting the data (The model does well only when it comes to the data it has seen before while performing poorly in the test, indicating that it is memorizing rather than generalizing).

Moreover, when training, we also want to hold out a bit of data in order to fine-tune the model, it's like another test set but not for test, but for evaluations to find the good hyperparameters (just like what we've done with the learning rate). "But why another test set? Why don't we evaluate the hyperparameters right on the training set just like what we've done lately?" - you may ask. The thing is, you're tweaking and tuning the hyperparameters so that they will make the model powerful with just the training data, there's no guarantee that these hyperparameters will work in practice, with data that it's never seen before. So we create another set of training called the dev_set, designed specifically for finding the hyperparameters.

Another side note here, what happens if the data is too small to split into 3 sets? How can we accurately evaluate the model? The solution for this is quite interesting: If the data is too small, we have to "reuse" it when testing. To be more specific, rather than having just one specific split and evaluating the model on that, we will create a kind of parallel universe where each one possesses a different split, then we evaluate the model and get the big picture. This clever method is cross-validation, which is widely used in ML tasks.

The library cross-validates by first randomly splitting the data into k folds, and then we will have k evaluation scores according to the splits where one of the k subsets is held out for testing. That is called k-fold cross validation.

Split the data

Just to remind you, here's a brief recap of our three sets:

  • Training set: This is for training the parameters, or training the model.
  • Dev set: This is for training the hyperparameters
  • Test set: This is for the evaluation of the model

Now let's implement the split in our code: We will first shuffle the data and get the counts for each of our sets, depending on the splits we want (in this project we will use 8:1:1). Then, we build the three sets of data from the counts we've got. Here's how it looks like:

block_size = 3 

# Create a function for the dataset construction 
# For convenience purposes
def build_dataset(words):
  X, Y = [], []
  for w in words:

    #print(w)
    context = [0] * block_size
    for ch in w + '.':
      ix = stoi[ch]
      X.append(context)
      Y.append(ix)
      #print(''.join(itos[i] for i in context), '--->', itos[ix])
      context = context[1:] + [ix] # crop and append

  X = torch.tensor(X)
  Y = torch.tensor(Y)
  print(X.shape, Y.shape)
  return X, Y

# Finding the counts in each sets
import random
random.seed(42)
random.shuffle(words)
n1 = int(0.8*len(words))
n2 = int(0.9*len(words))

# Split
Xtr, Ytr = build_dataset(words[:n1])
Xdev, Ydev = build_dataset(words[n1:n2])
Xte, Yte = build_dataset(words[n2:])
torch.Size([182625, 3]) torch.Size([182625])
torch.Size([22655, 3]) torch.Size([22655])
torch.Size([22866, 3]) torch.Size([22866])

That's our splits! Now we have to retrain the model, and remember to use the data in the training set only.

# Now we train on just the Xtr and Ytr
for i in range(30000):

  # minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (32,))

  # forward pass
  emb = C[Xtr[ix]] # (32, 3, 2)
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
  logits = h @ W2 + b2 # (32, 27)
  loss = F.cross_entropy(logits, Ytr[ix])
  #print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  lr = 0.1
  for p in parameters:
    p.data += -lr * p.grad
print(loss.item())
2.3609843254089355

And now comes the evaluation, remember to use the dev set for this task:

emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss
tensor(2.4578, grad_fn=<NllLossBackward0>)

You can see that the validation loss is a bit higher than the training loss, which indicates that the model is slightly overfitting. When it's overfitting, we often think about some regularization techniques, actually we did one during our previous project. But let's leave this part here, in the next section we will continue to explore the ways for fine-tuning the model.

Fine-tuning the Model

How can we make the model more powerful? The first thing that comes in our mind is perhaps: Just scale it!.

Scale the model

Let's modify the matrices that we initialize: We will increase the neurons in our hidden layer, from 100 to 300. Here's the modifications to the matrices:

g = torch.Generator().manual_seed(2147483647) 
C = torch.randn((27, 2), generator=g)
W1 = torch.randn((6, 300), generator=g)
b1 = torch.randn(300, generator=g)
W2 = torch.randn((300, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

Now let's run our model again (remember to set the requires_grad), but now we keep track of the steps and the loss, just to see how the loss changes after each iteration. We might need to take the log of the loss, for better visualization.

# Now we track the step, use the log for the loss
lossi = []
stepi = []
for i in range(30000):

  # minibatch construct
  ix = torch.randint(0, Xtr.shape[0], (32,))

  # forward pass
  emb = C[Xtr[ix]] 
  h = torch.tanh(emb.view(-1, 6) @ W1 + b1) 
  logits = h @ W2 + b2
  loss = F.cross_entropy(logits, Ytr[ix])
  #print(loss.item())

  # backward pass
  for p in parameters:
    p.grad = None
  loss.backward()

  # update
  #lr = lrs[i]
  lr = 0.1
  for p in parameters:
    p.data += -lr * p.grad

  # track stats

  stepi.append(i)
  lossi.append(loss.log().item())
plt.plot(stepi,lossi)

STEP-LOSS

You can see how the loss is quickly driven down and then fluctuates.

We should see our final loss:

2.799872875213623

It's worsen, maybe that's because we put some data out. Let's evaluate this model using dev set.

# Evaluate using the dev set
emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 6) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss
tensor(2.5693, grad_fn=<NllLossBackward0>)

Not bad! Our model is performing even better with the validation set. That's a huge improvement!

Changing the embedding size

We scaled the net and yet we did not get a really good result. So one thing that comes in mind is that maybe the bottleneck is not in the size of the net, but in the embedding size. Remember that in the first place, we squish the whole 27 characters (27-dimensional) in two dimensions, so there might be a lot of information loss.

But before we change the embedding size, let's take a look at how our model learned during the training phase. Let's visualize our embedding matrix C:

# visualize dimensions 0 and 1 of the embedding matrix C for all characters
plt.figure(figsize=(8,8))
plt.scatter(C[:,0].data, C[:,1].data, s=200)
for i in range(C.shape[0]):
    plt.text(C[i,0].item(), C[i,1].item(), itos[i], ha="center", va="center", color='white')
plt.grid('minor')

EMB MAP

The model learns how to cluster characters, and if you notice, it even separate the vowels a,i,u,e,o, which is astonishing to see. And the g is far apart, may be the model thinks that g is not really a common character in names? But is is still amazing how there are some meaningful things in our net, just by tweaking and tuning a whole bunch of characters.

Now let's increase the embedding size to 10, thus changing the C matrix and the W1.

g = torch.Generator().manual_seed(2147483647) # for reproducibility
C = torch.randn((27, 10), generator=g)
W1 = torch.randn((30, 300), generator=g)
b1 = torch.randn(300, generator=g)
W2 = torch.randn((300, 27), generator=g)
b2 = torch.randn(27, generator=g)
parameters = [C, W1, b1, W2, b2]

Adjusting the dimension wisely, and rerun the code, here is our loss:

# training loss
loss.item()
2.395942211151123
# validation loss
emb = C[Xdev] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Ydev)
loss
tensor(2.4487, grad_fn=<NllLossBackward0>)
emb = C[Xte] # (32, 3, 2)
h = torch.tanh(emb.view(-1, 30) @ W1 + b1) # (32, 100)
logits = h @ W2 + b2 # (32, 27)
loss = F.cross_entropy(logits, Yte)
loss
tensor(2.4514, grad_fn=<NllLossBackward0>)

Those were better than the previous result! We did great, man.

Sampling from the model

Now we should see our babies, the sampling method is just the same from the previous project, but let's see the results:

# sample from the model
g = torch.Generator().manual_seed(2147483647 + 10)

for _ in range(20):

    out = []
    context = [0] * block_size # initialize with all ...
    while True:
      emb = C[torch.tensor([context])] # (1,block_size,d)
      h = torch.tanh(emb.view(1, -1) @ W1 + b1)
      logits = h @ W2 + b2
      probs = F.softmax(logits, dim=1)
      ix = torch.multinomial(probs, num_samples=1, generator=g).item()
      context = context[1:] + [ix]
      out.append(ix)
      if ix == 0:
        break

    print(''.join(itos[i] for i in out))
carpah.
quelle.
khi.
mila.
tety.
salaysa.
jazhnen.
amerahtia.
qui.
nellana.
chaiiv.
kaneel.
hham.
pein.
quinn.
sron.
taivanbi.
watell.
dearisi.
fine.

It's way better! Eventhough some names are nonsense, but a lot of them starts to sound name-like, and that is a huge improvement for our model. We're moving forward.

Summarize

We've come so far in this journey: We learned numerous things about the MLP, we manipulate data using Tensors, then we fine-tune our model by a wide range of methods, including feeding data in mini batches, tweaking the learning rate, splitting the data, scale the model and increase the embedding size. It's a huge amount of knowledge!

Thanks for reading, see you in the next chapter!


This content originally appeared on DEV Community and was authored by Hưng Lê Tiến


Print Share Comment Cite Upload Translate Updates
APA

Hưng Lê Tiến | Sciencx (2025-11-20T22:21:42+00:00) LANGUAGE MODELS USING MLP (Part 2). Retrieved from https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/

MLA
" » LANGUAGE MODELS USING MLP (Part 2)." Hưng Lê Tiến | Sciencx - Thursday November 20, 2025, https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/
HARVARD
Hưng Lê Tiến | Sciencx Thursday November 20, 2025 » LANGUAGE MODELS USING MLP (Part 2)., viewed ,<https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/>
VANCOUVER
Hưng Lê Tiến | Sciencx - » LANGUAGE MODELS USING MLP (Part 2). [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/
CHICAGO
" » LANGUAGE MODELS USING MLP (Part 2)." Hưng Lê Tiến | Sciencx - Accessed . https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/
IEEE
" » LANGUAGE MODELS USING MLP (Part 2)." Hưng Lê Tiến | Sciencx [Online]. Available: https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/. [Accessed: ]
rf:citation
» LANGUAGE MODELS USING MLP (Part 2) | Hưng Lê Tiến | Sciencx | https://www.scien.cx/2025/11/20/language-models-using-mlp-part-2/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.