🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory

This content originally appeared on DEV Community and was authored by Abdelrahman Adnan

📚 Module Overview: Welcome to Module 2 of the LLM Zoomcamp! This chapter covers the theoretical foundations of vector search - the mathematical concepts, representation methods, and core techniques that power modern semantic search systems.

📖 Table of Contents

🔍 Introduction to Vector Search
🧮 Understanding Vectors and Embeddings
📊 Types of Vector Representations
⚡ Vector Search Techniques
🗄️ Vector Databases

🔍 Introduction to Vector Search

🎯 What is Vector Search?

Vector search is a modern approach to finding similar content by representing data as high-dimensional numerical vectors. Instead of searching for exact keyword matches like traditional search engines, vector search finds items that are semantically similar - meaning they have similar meanings or contexts.

🎬 Think of it this way: Imagine you're looking for movies similar to "The Matrix." Traditional keyword search might only find movies with "Matrix" in the title. Vector search, however, would find sci-fi movies with similar themes like "Inception" or "Blade Runner" because they share semantic similarity in the vector space.

🌟 Why Vector Search Matters

🧠 Semantic Understanding: Captures the meaning behind words, not just exact matches
🔄 Multi-modal Support: Works with text, images, audio, and other data types
🎯 Context Awareness: Understands relationships and context between different pieces of information
💬 Flexible Querying: Enables natural language queries and similarity-based searches

🚀 Real-World Applications

🔍 Search Engines: Finding relevant documents based on meaning, not just keywords
📝 Recommendation Systems: Suggesting products, movies, or content based on user preferences
❓ Question Answering: Retrieving relevant context for LLM-based chat systems
🖼️ Image Search: Finding visually similar images
🔄 Duplicate Detection: Identifying similar or duplicate content

🧮 Understanding Vectors and Embeddings

📐 What are Vectors?

In the context of machine learning and search, a vector is a list of numbers that represents data in a mathematical form that computers can understand and process. Think of a vector as coordinates in a multi-dimensional space.

📊 Simple Example:

A 2D vector: [3, 4] represents a point in 2D space
A 3D vector: [3, 4, 5] represents a point in 3D space
An embedding vector: [0.2, -0.1, 0.8, ...] might have 768 dimensions representing a word or document

💎 What are Embeddings?

Embeddings are a special type of vector that represents the semantic meaning of data (like words, sentences, or images) in a continuous numerical space. They are created by machine learning models trained on large datasets.

🎯 Key Properties of Good Embeddings:

🤝 Semantic Similarity: Similar items have similar vectors
📏 Distance Relationships: The distance between vectors reflects semantic relationships
📦 Dense Representation: Each dimension contributes to the meaning (unlike sparse representations)

🎭 How Embeddings Capture Meaning

Consider these movie examples:

"Interstellar" → [0.8, 0.1, 0.1] (high sci-fi, low drama, low comedy)
"The Notebook" → [0.1, 0.9, 0.1] (low sci-fi, high drama, low comedy)
"Shrek" → [0.1, 0.1, 0.8] (low sci-fi, low drama, high comedy)

Movies with similar genres will have vectors that are close to each other in this space.

📊 Types of Vector Representations

1️⃣ One-Hot Encoding

🔢 What it is: The simplest way to represent categorical data as vectors. Each item gets a vector with a single 1 and the rest 0s.

📝 Example:

# Vocabulary: ["apple", "banana", "cherry"]
"apple"  → [1, 0, 0]
"banana" → [0, 1, 0] 
"cherry" → [0, 0, 1]

💻 Code Example:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = ["apple", "banana", "cherry"]
data = np.array(words).reshape(-1, 1)
encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(data)
print("One-Hot Encoded Vectors:")
print(one_hot_encoded.toarray())

⚠️ Limitations:

No semantic relationships (apple and banana don't appear similar)
Very high dimensionality for large vocabularies
Sparse (mostly zeros)
Memory inefficient

2️⃣ Dense Vectors (Embeddings)

🌟 What they are: Compact, dense numerical representations where each dimension captures some aspect of meaning.

📊 Example:

"apple"  → [0.2, -0.1, 0.8, 0.3, ...]  # 300+ dimensions
"banana" → [0.1, -0.2, 0.7, 0.4, ...]  # Similar to apple (both fruits)
"car"    → [0.9, 0.5, -0.1, 0.2, ...]  # Very different from fruits

✅ Advantages:

Capture semantic relationships
Much more compact
Enable similarity calculations
Work well with machine learning models

🛠️ Creating Dense Vectors:

from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer("all-mpnet-base-v2")

# Generate embeddings
texts = ["I love machine learning", "AI is fascinating", "The weather is nice"]
embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")  # e.g., (3, 768)
print(f"First embedding: {embeddings[0][:5]}...")  # First 5 dimensions

3️⃣ Choosing the Right Dimensionality

🤔 How many dimensions do you need?

📝 Word embeddings: 100-300 dimensions (Word2Vec, GloVe)
📄 Sentence embeddings: 384-768 dimensions (BERT, MPNet)
📚 Document embeddings: 512-1024+ dimensions
🖼️ Image embeddings: 512-2048+ dimensions

⚖️ Trade-offs:

➕ More dimensions: Better representation, more computational cost
➖ Fewer dimensions: Faster processing, potential information loss

⚡ Vector Search Techniques

1️⃣ Similarity Metrics

Vector search relies on measuring how "similar" vectors are. Here are the most common metrics:

📐 Cosine Similarity

📊 What it measures: The angle between two vectors (ignores magnitude)
📈 Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
🎯 Best for: Text embeddings, normalized data

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
vec1 = np.array([[0.2, 0.8, 0.1]])
vec2 = np.array([[0.1, 0.9, 0.0]])

similarity = cosine_similarity(vec1, vec2)
print(f"Cosine similarity: {similarity[0][0]:.3f}")

📏 Euclidean Distance

📊 What it measures: Straight-line distance between points
📈 Range: 0 to infinity (0 = identical, larger = more different)
🎯 Best for: Image embeddings, when magnitude matters

from sklearn.metrics.pairwise import euclidean_distances

distance = euclidean_distances(vec1, vec2)
print(f"Euclidean distance: {distance[0][0]:.3f}")

2️⃣ Basic Vector Search

🔧 Simple Implementation:

def simple_vector_search(query_vector, document_vectors, top_k=5):
    """
    Find the most similar documents to a query
    """
    similarities = cosine_similarity([query_vector], document_vectors)[0]

    # Get indices of top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]

    return top_indices, similarities[top_indices]

# Example usage
query = "machine learning tutorial"
query_vector = model.encode(query)

# Assume we have document vectors
top_docs, scores = simple_vector_search(query_vector, document_embeddings)

3️⃣ Hybrid Search

⚠️ The Problem: Pure vector search sometimes misses exact matches or specific terms.

💡 The Solution: Combine vector search (semantic) with keyword search (lexical).

📖 Example Scenario:

Query: "18 U.S.C. § 1341" (specific legal code)
Vector search might find semantically similar laws
Keyword search finds the exact code
Hybrid search combines both for better results

🛠️ Implementation:

from sklearn.feature_extraction.text import TfidfVectorizer

def hybrid_search(query, documents, embeddings, alpha=0.5):
    """
    Combine vector and keyword search
    alpha: weight for vector search (1-alpha for keyword search)
    """
    # Vector search scores
    query_vector = model.encode(query)
    vector_scores = cosine_similarity([query_vector], embeddings)[0]

    # Keyword search scores (TF-IDF)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    query_tfidf = vectorizer.transform([query])
    keyword_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]

    # Normalize scores to 0-1 range
    vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
    keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min())

    # Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * keyword_scores

    return combined_scores

4️⃣ Approximate Nearest Neighbors (ANN)

For large datasets, exact search becomes too slow. ANN algorithms provide fast approximate results:

🚀 Popular ANN Libraries:

📊 FAISS: Facebook's similarity search library
🎵 Annoy: Spotify's approximate nearest neighbors
🕸️ HNSW: Hierarchical Navigable Small World graphs

💻 FAISS Example:

import faiss
import numpy as np

# Create FAISS index
dimension = 768  # embedding dimension
index = faiss.IndexFlatL2(dimension)  # L2 distance index

# Add vectors to index
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)

# Search
query_vector = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query_vector, k=5)  # top 5 results

🗄️ Vector Databases

🤖 What are Vector Databases?

Vector databases are specialized systems designed to store, index, and query high-dimensional vector data efficiently. They are optimized for similarity search operations that traditional databases struggle with.

🔧 Key Components

💾 Vector Storage: Efficiently stores millions/billions of high-dimensional vectors
🔍 Indexing Engine: Creates indices for fast retrieval (FAISS, HNSW, etc.)
⚡ Query Engine: Processes similarity queries using distance metrics
📊 Metadata Storage: Stores associated data like IDs, timestamps, categories

🏆 Popular Vector Databases

🔓 Open Source Options:

🚀 Milvus: Scalable vector database for AI applications
🕸️ Weaviate: Vector search engine with GraphQL API
📊 FAISS: Facebook's similarity search library
🔍 Elasticsearch: Traditional search with vector capabilities
🎨 Chroma: Simple vector database for LLM applications

💼 Managed/Commercial Options:

🌲 Pinecone: Fully managed vector database
⚡ Qdrant: Vector search engine with API
☁️ Weaviate Cloud: Managed Weaviate
🔍 AWS OpenSearch: Amazon's vector search service

🔄 Advantages Over Traditional Databases

Feature	Traditional DB	Vector DB
📊 Data Type	Structured (rows/columns)	High-dimensional vectors
🔍 Query Type	Exact matches, ranges	Similarity search
📈 Scalability	Good for structured data	Optimized for vector operations
⚡ Search Speed	Fast for indexed fields	Fast for similarity queries
🎯 Use Cases	CRUD operations	Recommendation, search, AI

🎓 Chapter 1 Summary

🌟 What You've Learned

In this foundational chapter, you've discovered:

🔍 Vector Search Fundamentals: Understanding semantic vs. keyword search
🧮 Vector Mathematics: How numbers represent meaning in multi-dimensional space
📊 Representation Types: From simple one-hot to sophisticated dense embeddings
⚡ Search Techniques: Similarity metrics, hybrid approaches, and optimization methods
🗄️ Storage Solutions: Specialized databases designed for vector operations

🔑 Key Takeaways

✅ Vectors enable computers to understand meaning - not just match text
✅ Embeddings capture semantic relationships - similar concepts cluster together
✅ Multiple similarity metrics exist - choose based on your data type and use case
✅ Hybrid search combines strengths - semantic understanding + exact matching
✅ Specialized databases matter - vector databases outperform traditional ones for similarity search

This content originally appeared on DEV Community and was authored by Abdelrahman Adnan

Print Share Comment Cite Upload Translate Updates

APA

Abdelrahman Adnan | Sciencx (2025-07-01T01:13:10+00:00) 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory. Retrieved from https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/

MLA

" » 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory." Abdelrahman Adnan | Sciencx - Tuesday July 1, 2025, https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/

HARVARD

Abdelrahman Adnan | Sciencx Tuesday July 1, 2025 » 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory., viewed ,<https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/>

VANCOUVER

Abdelrahman Adnan | Sciencx - » 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/

CHICAGO

" » 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory." Abdelrahman Adnan | Sciencx - Accessed . https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/

IEEE

" » 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory." Abdelrahman Adnan | Sciencx [Online]. Available: https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/. [Accessed: ]

rf:citation

» 🎓 LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory | Abdelrahman Adnan | Sciencx | https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.