Learn how LLMs convert language into vectors using tokenization and embeddings.

This content originally appeared on DEV Community and was authored by Cristian Sifuentes

How Do LLMs Understand Language?

If you’ve ever wondered how ChatGPT, Claude, or Gemini understands human language, the answer lies in three foundational techniques: tokenization, vectorization, and embeddings.

These methods convert your words into mathematical formats that large language models (LLMs) can learn from, reason about, and generate responses through.

In this article, we’ll break down each technique using analogies, math, and Python-ready examples.

What Is Tokenization?

Tokenization is the first step in natural language processing (NLP).

It splits a sentence like:

"Hello, how are you?"

Into parts like:

Whole words: "hello", "how", "are", "you"
Or subwords: "hel", "##lo", "##w" (used in models like BERT)

Why It Matters:

Tokenization gives structure to the input, allowing LLMs to focus on smaller, context-aware units.

Popular strategies include:

WordPiece (BERT)
Byte Pair Encoding (BPE) (GPT)

What Is Vectorization?

Once tokenized, each token is converted into a vector—a numeric list that allows it to exist in a mathematical space.

A token might become [0.25, -0.14, 0.93, ...]
These vectors allow for math-based comparisons (e.g., similarity, analogy)

Example: Vector Arithmetic

If:

man → vector A
king → vector B
woman → vector C

Then:

king - man + woman ≈ queen

This isn’t just math—it’s semantic reasoning in action.

What Are Embeddings?

Embeddings are vector representations fine-tuned through learning.

Unlike random initial vectors, embeddings:

Cluster similar words ("hello", "hi", "howdy")
Space out unrelated words ("cat" and "universe")
Are trained via gradient descent to represent meaning

Properties:

Typically 768–2048+ dimensions
Used as input for transformer blocks
Stored in lookup tables in frameworks like PyTorch or TensorFlow

Implementing in Python with Transformers

Let’s walk through this using the 🤗 Transformers library:

1. Install Requirements

pip install transformers torch

2. Tokenize and Get Embeddings

from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

sentence = "Artificial intelligence is fascinating"
tokens = tokenizer(sentence, return_tensors="pt")
with torch.no_grad():
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state.mean(dim=1)  # Sentence embedding

3. Measure Semantic Similarity

sentence2 = "AI is amazing"
tokens2 = tokenizer(sentence2, return_tensors="pt")
with torch.no_grad():
    outputs2 = model(**tokens2)
    embeddings2 = outputs2.last_hidden_state.mean(dim=1)

similarity = cosine_similarity(embeddings, embeddings2)
print("Cosine Similarity:", similarity[0][0])

Result near 1 = very similar; near 0 = unrelated.

Real-World Applications

Semantic search: Retrieve results based on meaning, not keywords
Chatbots: Interpret similar queries despite phrasing
AI writing tools: Suggest coherent next sentences

Want to Learn More?

Use platforms like:

Try your own embeddings on topics you care about—animals, cities, programming languages—and visualize the results with t-SNE.

Conclusion

Tokenization breaks your thoughts into chunks. Vectorization turns those chunks into numbers. Embeddings give those numbers meaning.

Understanding these foundations equips you to:

Build smarter AI applications
Customize fine-tuning strategies
Decode how LLMs think

Let’s keep demystifying the AI beneath the interface. 🧠💡

✍️ Written by: Cristian Sifuentes – Full-stack dev crafting scalable apps with [NET - Azure], [Angular - React], Git, SQL & extensions. Clean code, dark themes, atomic commits

#llm #embeddings #nlp #tokenization #huggingface

This content originally appeared on DEV Community and was authored by Cristian Sifuentes

Print Share Comment Cite Upload Translate Updates

APA

Cristian Sifuentes | Sciencx (2025-06-28T23:58:02+00:00) Learn how LLMs convert language into vectors using tokenization and embeddings.. Retrieved from https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/

MLA

" » Learn how LLMs convert language into vectors using tokenization and embeddings.." Cristian Sifuentes | Sciencx - Saturday June 28, 2025, https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/

HARVARD

Cristian Sifuentes | Sciencx Saturday June 28, 2025 » Learn how LLMs convert language into vectors using tokenization and embeddings.., viewed ,<https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/>

VANCOUVER

Cristian Sifuentes | Sciencx - » Learn how LLMs convert language into vectors using tokenization and embeddings.. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/

CHICAGO

" » Learn how LLMs convert language into vectors using tokenization and embeddings.." Cristian Sifuentes | Sciencx - Accessed . https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/

IEEE

" » Learn how LLMs convert language into vectors using tokenization and embeddings.." Cristian Sifuentes | Sciencx [Online]. Available: https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/. [Accessed: ]

rf:citation

» Learn how LLMs convert language into vectors using tokenization and embeddings. | Cristian Sifuentes | Sciencx | https://www.scien.cx/2025/06/28/learn-how-llms-convert-language-into-vectors-using-tokenization-and-embeddings/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.