This content originally appeared on DEV Community and was authored by Abdelrahman Adnan
๐ Module Overview: Welcome to Module 2 of the LLM Zoomcamp! This chapter covers the theoretical foundations of vector search - the mathematical concepts, representation methods, and core techniques that power modern semantic search systems.
๐ Table of Contents
- ๐ Introduction to Vector Search
- ๐งฎ Understanding Vectors and Embeddings
- ๐ Types of Vector Representations
- โก Vector Search Techniques
- ๐๏ธ Vector Databases
๐ Introduction to Vector Search
๐ฏ What is Vector Search?
Vector search is a modern approach to finding similar content by representing data as high-dimensional numerical vectors. Instead of searching for exact keyword matches like traditional search engines, vector search finds items that are semantically similar - meaning they have similar meanings or contexts.
๐ฌ Think of it this way: Imagine you're looking for movies similar to "The Matrix." Traditional keyword search might only find movies with "Matrix" in the title. Vector search, however, would find sci-fi movies with similar themes like "Inception" or "Blade Runner" because they share semantic similarity in the vector space.
๐ Why Vector Search Matters
- ๐ง Semantic Understanding: Captures the meaning behind words, not just exact matches
- ๐ Multi-modal Support: Works with text, images, audio, and other data types
- ๐ฏ Context Awareness: Understands relationships and context between different pieces of information
- ๐ฌ Flexible Querying: Enables natural language queries and similarity-based searches
๐ Real-World Applications
- ๐ Search Engines: Finding relevant documents based on meaning, not just keywords
- ๐ Recommendation Systems: Suggesting products, movies, or content based on user preferences
- โ Question Answering: Retrieving relevant context for LLM-based chat systems
- ๐ผ๏ธ Image Search: Finding visually similar images
- ๐ Duplicate Detection: Identifying similar or duplicate content
๐งฎ Understanding Vectors and Embeddings
๐ What are Vectors?
In the context of machine learning and search, a vector is a list of numbers that represents data in a mathematical form that computers can understand and process. Think of a vector as coordinates in a multi-dimensional space.
๐ Simple Example:
- A 2D vector:
[3, 4]
represents a point in 2D space - A 3D vector:
[3, 4, 5]
represents a point in 3D space - An embedding vector:
[0.2, -0.1, 0.8, ...]
might have 768 dimensions representing a word or document
๐ What are Embeddings?
Embeddings are a special type of vector that represents the semantic meaning of data (like words, sentences, or images) in a continuous numerical space. They are created by machine learning models trained on large datasets.
๐ฏ Key Properties of Good Embeddings:
- ๐ค Semantic Similarity: Similar items have similar vectors
- ๐ Distance Relationships: The distance between vectors reflects semantic relationships
- ๐ฆ Dense Representation: Each dimension contributes to the meaning (unlike sparse representations)
๐ญ How Embeddings Capture Meaning
Consider these movie examples:
- "Interstellar" โ
[0.8, 0.1, 0.1]
(high sci-fi, low drama, low comedy) - "The Notebook" โ
[0.1, 0.9, 0.1]
(low sci-fi, high drama, low comedy) - "Shrek" โ
[0.1, 0.1, 0.8]
(low sci-fi, low drama, high comedy)
Movies with similar genres will have vectors that are close to each other in this space.
๐ Types of Vector Representations
1๏ธโฃ One-Hot Encoding
๐ข What it is: The simplest way to represent categorical data as vectors. Each item gets a vector with a single 1 and the rest 0s.
๐ Example:
# Vocabulary: ["apple", "banana", "cherry"]
"apple" โ [1, 0, 0]
"banana" โ [0, 1, 0]
"cherry" โ [0, 0, 1]
๐ป Code Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
words = ["apple", "banana", "cherry"]
data = np.array(words).reshape(-1, 1)
encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(data)
print("One-Hot Encoded Vectors:")
print(one_hot_encoded.toarray())
โ ๏ธ Limitations:
- No semantic relationships (apple and banana don't appear similar)
- Very high dimensionality for large vocabularies
- Sparse (mostly zeros)
- Memory inefficient
2๏ธโฃ Dense Vectors (Embeddings)
๐ What they are: Compact, dense numerical representations where each dimension captures some aspect of meaning.
๐ Example:
"apple" โ [0.2, -0.1, 0.8, 0.3, ...] # 300+ dimensions
"banana" โ [0.1, -0.2, 0.7, 0.4, ...] # Similar to apple (both fruits)
"car" โ [0.9, 0.5, -0.1, 0.2, ...] # Very different from fruits
โ Advantages:
- Capture semantic relationships
- Much more compact
- Enable similarity calculations
- Work well with machine learning models
๐ ๏ธ Creating Dense Vectors:
from sentence_transformers import SentenceTransformer
# Load pre-trained model
model = SentenceTransformer("all-mpnet-base-v2")
# Generate embeddings
texts = ["I love machine learning", "AI is fascinating", "The weather is nice"]
embeddings = model.encode(texts)
print(f"Embedding shape: {embeddings.shape}") # e.g., (3, 768)
print(f"First embedding: {embeddings[0][:5]}...") # First 5 dimensions
3๏ธโฃ Choosing the Right Dimensionality
๐ค How many dimensions do you need?
- ๐ Word embeddings: 100-300 dimensions (Word2Vec, GloVe)
- ๐ Sentence embeddings: 384-768 dimensions (BERT, MPNet)
- ๐ Document embeddings: 512-1024+ dimensions
- ๐ผ๏ธ Image embeddings: 512-2048+ dimensions
โ๏ธ Trade-offs:
- โ More dimensions: Better representation, more computational cost
- โ Fewer dimensions: Faster processing, potential information loss
โก Vector Search Techniques
1๏ธโฃ Similarity Metrics
Vector search relies on measuring how "similar" vectors are. Here are the most common metrics:
๐ Cosine Similarity
๐ What it measures: The angle between two vectors (ignores magnitude)
๐ Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
๐ฏ Best for: Text embeddings, normalized data
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Example vectors
vec1 = np.array([[0.2, 0.8, 0.1]])
vec2 = np.array([[0.1, 0.9, 0.0]])
similarity = cosine_similarity(vec1, vec2)
print(f"Cosine similarity: {similarity[0][0]:.3f}")
๐ Euclidean Distance
๐ What it measures: Straight-line distance between points
๐ Range: 0 to infinity (0 = identical, larger = more different)
๐ฏ Best for: Image embeddings, when magnitude matters
from sklearn.metrics.pairwise import euclidean_distances
distance = euclidean_distances(vec1, vec2)
print(f"Euclidean distance: {distance[0][0]:.3f}")
2๏ธโฃ Basic Vector Search
๐ง Simple Implementation:
def simple_vector_search(query_vector, document_vectors, top_k=5):
"""
Find the most similar documents to a query
"""
similarities = cosine_similarity([query_vector], document_vectors)[0]
# Get indices of top-k most similar documents
top_indices = np.argsort(similarities)[::-1][:top_k]
return top_indices, similarities[top_indices]
# Example usage
query = "machine learning tutorial"
query_vector = model.encode(query)
# Assume we have document vectors
top_docs, scores = simple_vector_search(query_vector, document_embeddings)
3๏ธโฃ Hybrid Search
โ ๏ธ The Problem: Pure vector search sometimes misses exact matches or specific terms.
๐ก The Solution: Combine vector search (semantic) with keyword search (lexical).
๐ Example Scenario:
- Query: "18 U.S.C. ยง 1341" (specific legal code)
- Vector search might find semantically similar laws
- Keyword search finds the exact code
- Hybrid search combines both for better results
๐ ๏ธ Implementation:
from sklearn.feature_extraction.text import TfidfVectorizer
def hybrid_search(query, documents, embeddings, alpha=0.5):
"""
Combine vector and keyword search
alpha: weight for vector search (1-alpha for keyword search)
"""
# Vector search scores
query_vector = model.encode(query)
vector_scores = cosine_similarity([query_vector], embeddings)[0]
# Keyword search scores (TF-IDF)
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
query_tfidf = vectorizer.transform([query])
keyword_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]
# Normalize scores to 0-1 range
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min())
# Combine scores
combined_scores = alpha * vector_scores + (1 - alpha) * keyword_scores
return combined_scores
4๏ธโฃ Approximate Nearest Neighbors (ANN)
For large datasets, exact search becomes too slow. ANN algorithms provide fast approximate results:
๐ Popular ANN Libraries:
- ๐ FAISS: Facebook's similarity search library
- ๐ต Annoy: Spotify's approximate nearest neighbors
- ๐ธ๏ธ HNSW: Hierarchical Navigable Small World graphs
๐ป FAISS Example:
import faiss
import numpy as np
# Create FAISS index
dimension = 768 # embedding dimension
index = faiss.IndexFlatL2(dimension) # L2 distance index
# Add vectors to index
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)
# Search
query_vector = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query_vector, k=5) # top 5 results
๐๏ธ Vector Databases
๐ค What are Vector Databases?
Vector databases are specialized systems designed to store, index, and query high-dimensional vector data efficiently. They are optimized for similarity search operations that traditional databases struggle with.
๐ง Key Components
- ๐พ Vector Storage: Efficiently stores millions/billions of high-dimensional vectors
- ๐ Indexing Engine: Creates indices for fast retrieval (FAISS, HNSW, etc.)
- โก Query Engine: Processes similarity queries using distance metrics
- ๐ Metadata Storage: Stores associated data like IDs, timestamps, categories
๐ Popular Vector Databases
๐ Open Source Options:
- ๐ Milvus: Scalable vector database for AI applications
- ๐ธ๏ธ Weaviate: Vector search engine with GraphQL API
- ๐ FAISS: Facebook's similarity search library
- ๐ Elasticsearch: Traditional search with vector capabilities
- ๐จ Chroma: Simple vector database for LLM applications
๐ผ Managed/Commercial Options:
- ๐ฒ Pinecone: Fully managed vector database
- โก Qdrant: Vector search engine with API
- โ๏ธ Weaviate Cloud: Managed Weaviate
- ๐ AWS OpenSearch: Amazon's vector search service
๐ Advantages Over Traditional Databases
Feature | Traditional DB | Vector DB |
---|---|---|
๐ Data Type | Structured (rows/columns) | High-dimensional vectors |
๐ Query Type | Exact matches, ranges | Similarity search |
๐ Scalability | Good for structured data | Optimized for vector operations |
โก Search Speed | Fast for indexed fields | Fast for similarity queries |
๐ฏ Use Cases | CRUD operations | Recommendation, search, AI |
๐ Chapter 1 Summary
๐ What You've Learned
In this foundational chapter, you've discovered:
- ๐ Vector Search Fundamentals: Understanding semantic vs. keyword search
- ๐งฎ Vector Mathematics: How numbers represent meaning in multi-dimensional space
- ๐ Representation Types: From simple one-hot to sophisticated dense embeddings
- โก Search Techniques: Similarity metrics, hybrid approaches, and optimization methods
- ๐๏ธ Storage Solutions: Specialized databases designed for vector operations
๐ Key Takeaways
โ
Vectors enable computers to understand meaning - not just match text
โ
Embeddings capture semantic relationships - similar concepts cluster together
โ
Multiple similarity metrics exist - choose based on your data type and use case
โ
Hybrid search combines strengths - semantic understanding + exact matching
โ
Specialized databases matter - vector databases outperform traditional ones for similarity search
This content originally appeared on DEV Community and was authored by Abdelrahman Adnan

Abdelrahman Adnan | Sciencx (2025-07-01T01:13:10+00:00) ๐ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory. Retrieved from https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.