๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory

๐Ÿ“š Module Overview: Welcome to Module 2 of the LLM Zoomcamp! This chapter covers the theoretical foundations of vector search – the mathematical concepts, representation methods, and core techniques that power modern semantic search systems.


This content originally appeared on DEV Community and was authored by Abdelrahman Adnan

๐Ÿ“š Module Overview: Welcome to Module 2 of the LLM Zoomcamp! This chapter covers the theoretical foundations of vector search - the mathematical concepts, representation methods, and core techniques that power modern semantic search systems.

๐Ÿ“– Table of Contents

  1. ๐Ÿ” Introduction to Vector Search
  2. ๐Ÿงฎ Understanding Vectors and Embeddings
  3. ๐Ÿ“Š Types of Vector Representations
  4. โšก Vector Search Techniques
  5. ๐Ÿ—„๏ธ Vector Databases

๐Ÿ” Introduction to Vector Search

๐ŸŽฏ What is Vector Search?

Vector search is a modern approach to finding similar content by representing data as high-dimensional numerical vectors. Instead of searching for exact keyword matches like traditional search engines, vector search finds items that are semantically similar - meaning they have similar meanings or contexts.

๐ŸŽฌ Think of it this way: Imagine you're looking for movies similar to "The Matrix." Traditional keyword search might only find movies with "Matrix" in the title. Vector search, however, would find sci-fi movies with similar themes like "Inception" or "Blade Runner" because they share semantic similarity in the vector space.

๐ŸŒŸ Why Vector Search Matters

  1. ๐Ÿง  Semantic Understanding: Captures the meaning behind words, not just exact matches
  2. ๐Ÿ”„ Multi-modal Support: Works with text, images, audio, and other data types
  3. ๐ŸŽฏ Context Awareness: Understands relationships and context between different pieces of information
  4. ๐Ÿ’ฌ Flexible Querying: Enables natural language queries and similarity-based searches

๐Ÿš€ Real-World Applications

  • ๐Ÿ” Search Engines: Finding relevant documents based on meaning, not just keywords
  • ๐Ÿ“ Recommendation Systems: Suggesting products, movies, or content based on user preferences
  • โ“ Question Answering: Retrieving relevant context for LLM-based chat systems
  • ๐Ÿ–ผ๏ธ Image Search: Finding visually similar images
  • ๐Ÿ”„ Duplicate Detection: Identifying similar or duplicate content

๐Ÿงฎ Understanding Vectors and Embeddings

๐Ÿ“ What are Vectors?

In the context of machine learning and search, a vector is a list of numbers that represents data in a mathematical form that computers can understand and process. Think of a vector as coordinates in a multi-dimensional space.

๐Ÿ“Š Simple Example:

  • A 2D vector: [3, 4] represents a point in 2D space
  • A 3D vector: [3, 4, 5] represents a point in 3D space
  • An embedding vector: [0.2, -0.1, 0.8, ...] might have 768 dimensions representing a word or document

๐Ÿ’Ž What are Embeddings?

Embeddings are a special type of vector that represents the semantic meaning of data (like words, sentences, or images) in a continuous numerical space. They are created by machine learning models trained on large datasets.

๐ŸŽฏ Key Properties of Good Embeddings:

  1. ๐Ÿค Semantic Similarity: Similar items have similar vectors
  2. ๐Ÿ“ Distance Relationships: The distance between vectors reflects semantic relationships
  3. ๐Ÿ“ฆ Dense Representation: Each dimension contributes to the meaning (unlike sparse representations)

๐ŸŽญ How Embeddings Capture Meaning

Consider these movie examples:

  • "Interstellar" โ†’ [0.8, 0.1, 0.1] (high sci-fi, low drama, low comedy)
  • "The Notebook" โ†’ [0.1, 0.9, 0.1] (low sci-fi, high drama, low comedy)
  • "Shrek" โ†’ [0.1, 0.1, 0.8] (low sci-fi, low drama, high comedy)

Movies with similar genres will have vectors that are close to each other in this space.

๐Ÿ“Š Types of Vector Representations

1๏ธโƒฃ One-Hot Encoding

๐Ÿ”ข What it is: The simplest way to represent categorical data as vectors. Each item gets a vector with a single 1 and the rest 0s.

๐Ÿ“ Example:

# Vocabulary: ["apple", "banana", "cherry"]
"apple"  โ†’ [1, 0, 0]
"banana" โ†’ [0, 1, 0] 
"cherry" โ†’ [0, 0, 1]

๐Ÿ’ป Code Example:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

words = ["apple", "banana", "cherry"]
data = np.array(words).reshape(-1, 1)
encoder = OneHotEncoder()
one_hot_encoded = encoder.fit_transform(data)
print("One-Hot Encoded Vectors:")
print(one_hot_encoded.toarray())

โš ๏ธ Limitations:

  • No semantic relationships (apple and banana don't appear similar)
  • Very high dimensionality for large vocabularies
  • Sparse (mostly zeros)
  • Memory inefficient

2๏ธโƒฃ Dense Vectors (Embeddings)

๐ŸŒŸ What they are: Compact, dense numerical representations where each dimension captures some aspect of meaning.

๐Ÿ“Š Example:

"apple"  โ†’ [0.2, -0.1, 0.8, 0.3, ...]  # 300+ dimensions
"banana" โ†’ [0.1, -0.2, 0.7, 0.4, ...]  # Similar to apple (both fruits)
"car"    โ†’ [0.9, 0.5, -0.1, 0.2, ...]  # Very different from fruits

โœ… Advantages:

  • Capture semantic relationships
  • Much more compact
  • Enable similarity calculations
  • Work well with machine learning models

๐Ÿ› ๏ธ Creating Dense Vectors:

from sentence_transformers import SentenceTransformer

# Load pre-trained model
model = SentenceTransformer("all-mpnet-base-v2")

# Generate embeddings
texts = ["I love machine learning", "AI is fascinating", "The weather is nice"]
embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")  # e.g., (3, 768)
print(f"First embedding: {embeddings[0][:5]}...")  # First 5 dimensions

3๏ธโƒฃ Choosing the Right Dimensionality

๐Ÿค” How many dimensions do you need?

  • ๐Ÿ“ Word embeddings: 100-300 dimensions (Word2Vec, GloVe)
  • ๐Ÿ“„ Sentence embeddings: 384-768 dimensions (BERT, MPNet)
  • ๐Ÿ“š Document embeddings: 512-1024+ dimensions
  • ๐Ÿ–ผ๏ธ Image embeddings: 512-2048+ dimensions

โš–๏ธ Trade-offs:

  • โž• More dimensions: Better representation, more computational cost
  • โž– Fewer dimensions: Faster processing, potential information loss

โšก Vector Search Techniques

1๏ธโƒฃ Similarity Metrics

Vector search relies on measuring how "similar" vectors are. Here are the most common metrics:

๐Ÿ“ Cosine Similarity

๐Ÿ“Š What it measures: The angle between two vectors (ignores magnitude)
๐Ÿ“ˆ Range: -1 to 1 (1 = identical, 0 = orthogonal, -1 = opposite)
๐ŸŽฏ Best for: Text embeddings, normalized data

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
vec1 = np.array([[0.2, 0.8, 0.1]])
vec2 = np.array([[0.1, 0.9, 0.0]])

similarity = cosine_similarity(vec1, vec2)
print(f"Cosine similarity: {similarity[0][0]:.3f}")

๐Ÿ“ Euclidean Distance

๐Ÿ“Š What it measures: Straight-line distance between points
๐Ÿ“ˆ Range: 0 to infinity (0 = identical, larger = more different)
๐ŸŽฏ Best for: Image embeddings, when magnitude matters

from sklearn.metrics.pairwise import euclidean_distances

distance = euclidean_distances(vec1, vec2)
print(f"Euclidean distance: {distance[0][0]:.3f}")

2๏ธโƒฃ Basic Vector Search

๐Ÿ”ง Simple Implementation:

def simple_vector_search(query_vector, document_vectors, top_k=5):
    """
    Find the most similar documents to a query
    """
    similarities = cosine_similarity([query_vector], document_vectors)[0]

    # Get indices of top-k most similar documents
    top_indices = np.argsort(similarities)[::-1][:top_k]

    return top_indices, similarities[top_indices]

# Example usage
query = "machine learning tutorial"
query_vector = model.encode(query)

# Assume we have document vectors
top_docs, scores = simple_vector_search(query_vector, document_embeddings)

3๏ธโƒฃ Hybrid Search

โš ๏ธ The Problem: Pure vector search sometimes misses exact matches or specific terms.

๐Ÿ’ก The Solution: Combine vector search (semantic) with keyword search (lexical).

๐Ÿ“– Example Scenario:

  • Query: "18 U.S.C. ยง 1341" (specific legal code)
  • Vector search might find semantically similar laws
  • Keyword search finds the exact code
  • Hybrid search combines both for better results

๐Ÿ› ๏ธ Implementation:

from sklearn.feature_extraction.text import TfidfVectorizer

def hybrid_search(query, documents, embeddings, alpha=0.5):
    """
    Combine vector and keyword search
    alpha: weight for vector search (1-alpha for keyword search)
    """
    # Vector search scores
    query_vector = model.encode(query)
    vector_scores = cosine_similarity([query_vector], embeddings)[0]

    # Keyword search scores (TF-IDF)
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(documents)
    query_tfidf = vectorizer.transform([query])
    keyword_scores = cosine_similarity(query_tfidf, tfidf_matrix)[0]

    # Normalize scores to 0-1 range
    vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min())
    keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min())

    # Combine scores
    combined_scores = alpha * vector_scores + (1 - alpha) * keyword_scores

    return combined_scores

4๏ธโƒฃ Approximate Nearest Neighbors (ANN)

For large datasets, exact search becomes too slow. ANN algorithms provide fast approximate results:

๐Ÿš€ Popular ANN Libraries:

  • ๐Ÿ“Š FAISS: Facebook's similarity search library
  • ๐ŸŽต Annoy: Spotify's approximate nearest neighbors
  • ๐Ÿ•ธ๏ธ HNSW: Hierarchical Navigable Small World graphs

๐Ÿ’ป FAISS Example:

import faiss
import numpy as np

# Create FAISS index
dimension = 768  # embedding dimension
index = faiss.IndexFlatL2(dimension)  # L2 distance index

# Add vectors to index
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)

# Search
query_vector = np.random.random((1, dimension)).astype('float32')
distances, indices = index.search(query_vector, k=5)  # top 5 results

๐Ÿ—„๏ธ Vector Databases

๐Ÿค– What are Vector Databases?

Vector databases are specialized systems designed to store, index, and query high-dimensional vector data efficiently. They are optimized for similarity search operations that traditional databases struggle with.

๐Ÿ”ง Key Components

  1. ๐Ÿ’พ Vector Storage: Efficiently stores millions/billions of high-dimensional vectors
  2. ๐Ÿ” Indexing Engine: Creates indices for fast retrieval (FAISS, HNSW, etc.)
  3. โšก Query Engine: Processes similarity queries using distance metrics
  4. ๐Ÿ“Š Metadata Storage: Stores associated data like IDs, timestamps, categories

๐Ÿ† Popular Vector Databases

๐Ÿ”“ Open Source Options:

  1. ๐Ÿš€ Milvus: Scalable vector database for AI applications
  2. ๐Ÿ•ธ๏ธ Weaviate: Vector search engine with GraphQL API
  3. ๐Ÿ“Š FAISS: Facebook's similarity search library
  4. ๐Ÿ” Elasticsearch: Traditional search with vector capabilities
  5. ๐ŸŽจ Chroma: Simple vector database for LLM applications

๐Ÿ’ผ Managed/Commercial Options:

  1. ๐ŸŒฒ Pinecone: Fully managed vector database
  2. โšก Qdrant: Vector search engine with API
  3. โ˜๏ธ Weaviate Cloud: Managed Weaviate
  4. ๐Ÿ” AWS OpenSearch: Amazon's vector search service

๐Ÿ”„ Advantages Over Traditional Databases

Feature Traditional DB Vector DB
๐Ÿ“Š Data Type Structured (rows/columns) High-dimensional vectors
๐Ÿ” Query Type Exact matches, ranges Similarity search
๐Ÿ“ˆ Scalability Good for structured data Optimized for vector operations
โšก Search Speed Fast for indexed fields Fast for similarity queries
๐ŸŽฏ Use Cases CRUD operations Recommendation, search, AI

๐ŸŽ“ Chapter 1 Summary

๐ŸŒŸ What You've Learned

In this foundational chapter, you've discovered:

  1. ๐Ÿ” Vector Search Fundamentals: Understanding semantic vs. keyword search
  2. ๐Ÿงฎ Vector Mathematics: How numbers represent meaning in multi-dimensional space
  3. ๐Ÿ“Š Representation Types: From simple one-hot to sophisticated dense embeddings
  4. โšก Search Techniques: Similarity metrics, hybrid approaches, and optimization methods
  5. ๐Ÿ—„๏ธ Storage Solutions: Specialized databases designed for vector operations

๐Ÿ”‘ Key Takeaways

โœ… Vectors enable computers to understand meaning - not just match text
โœ… Embeddings capture semantic relationships - similar concepts cluster together
โœ… Multiple similarity metrics exist - choose based on your data type and use case
โœ… Hybrid search combines strengths - semantic understanding + exact matching
โœ… Specialized databases matter - vector databases outperform traditional ones for similarity search


This content originally appeared on DEV Community and was authored by Abdelrahman Adnan


Print Share Comment Cite Upload Translate Updates
APA

Abdelrahman Adnan | Sciencx (2025-07-01T01:13:10+00:00) ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory. Retrieved from https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/

MLA
" » ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory." Abdelrahman Adnan | Sciencx - Tuesday July 1, 2025, https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/
HARVARD
Abdelrahman Adnan | Sciencx Tuesday July 1, 2025 » ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory., viewed ,<https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/>
VANCOUVER
Abdelrahman Adnan | Sciencx - » ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/
CHICAGO
" » ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory." Abdelrahman Adnan | Sciencx - Accessed . https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/
IEEE
" » ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory." Abdelrahman Adnan | Sciencx [Online]. Available: https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/. [Accessed: ]
rf:citation
» ๐ŸŽ“ LLM Zoomcamp Module 2 – Chapter 1: Vector Search Foundations & Theory | Abdelrahman Adnan | Sciencx | https://www.scien.cx/2025/07/01/%f0%9f%8e%93-llm-zoomcamp-module-2-chapter-1-vector-search-foundations-theory/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.