This content originally appeared on DEV Community and was authored by FARHAN KHAN
1️⃣ Introduction
🔹 Vector
- A vector is simply an ordered list (array) of numbers.
 - Can represent data points in 2D, 3D, or higher dimensions.
 - Example:
- 2D → 
[3.5, 7.2] - 3D → 
[1.2, -4.5, 6.0] 
 - 2D → 
 - In machine learning, vectors are used to describe positions in a multi-dimensional space.
 
🔹 Embedding Vector
- An embedding vector (or just embedding) is a special kind of vector generated by an embedding model.
 - Purpose: represent complex data (text, images, audio, etc.) in a way that captures meaning and similarity.
 - All embeddings are vectors, but not all vectors are embeddings.
 - Dimensions (e.g., 384, 768, 1536) are fixed by the model design.
- Lightweight (384–512 dimensions) → e.g., all-MiniLM-L6-v2 (384d), Universal Sentence Encoder (512d)
 - Standard (768–1,536 dimensions) → e.g., all-mpnet-base-v2 (768d), OpenAI text-embedding-ada-002 (1,536d)
 - High capacity (2,000–3,000+ dimensions) → e.g., OpenAI text-embedding-3-large (3,072d)
 
 - Higher dimensions = richer detail, but more costly to store and search.
 - Represents the semantic meaning of data, so similar concepts are placed close together in vector space.
- “dog” and “puppy” → embedding vectors close together.
 - “dog” and “car” → embedding vectors far apart.
 
 
🔹 Positional Encoding
- Positional encoding is a technique used in Transformer models to provide word order information to embeddings.
 - Purpose: ensure that the sequence of tokens matters, so sentences with the same words in different orders have different meanings.
 - Implemented by adding a positional vector to each token embedding before feeding it into the model.
- Can be sinusoidal (fixed mathematical functions).
 - Or learned (trainable position embeddings).
 
 - Formula (sinusoidal encoding):
- PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
 - PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
 
 - Sentence-level embeddings indirectly include positional information, since it’s baked into the Transformer during encoding.
 - Preserves contextual meaning by distinguishing different word orders.
- “dog bites man” → one meaning.
 - “man bites dog” → very different meaning.
 
 
🔹 Vector Databases
- A vector database is a system built to store and retrieve vectors, especially embeddings.
 - Core features:
- Persistence: store large volumes of embeddings.
 - Similarity search: find nearest neighbors using cosine similarity, dot product, or Euclidean distance.
 - Indexing: HNSW, IVF, Annoy, PQ for fast retrieval.
 - Database functions: CRUD operations, filtering, replication, and scaling.
 
 - Purpose: enable semantic search → finding results based on meaning instead of exact keyword matches.
 
2️⃣ Persistence
- Vectors often number in the millions or billions, so keeping them only in memory is not practical.
 Persistence ensures embeddings are stored long-term and survive restarts or failures.
CRUD: create, read, update, and delete vectors.
Durability: vectors are written to disk or distributed storage.
Index persistence: not just raw vectors, but also the indexing structures (like HNSW graphs).
🔹 Example: creating an embedding (OpenAI)
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="The cat sat on the mat"
)
vector = response.data[0].embedding
print(len(vector))  # 1536 dimensions
🔹 Example: insertion (ElasticSearch)
PUT /documents/_doc/1
{
  "content": "The cat sat on the mat",
  "embedding": [0.12, -0.87, 0.33]
}
PUT /documents/_doc/2
{
  "content": "The dog chased the ball",
  "embedding": [0.91, 0.04, -0.22]
}
3️⃣ Similarity Search
- Core function of a vector database → find vectors closest in meaning to a query vector.
 - Powers semantic search, recommendations, fraud detection, and Retrieval-Augmented Generation (RAG).
 
🔹 Common distance metrics
- Cosine similarity: measures angle between vectors.
 - Euclidean distance: straight-line distance in vector space.
 - Dot product: measures alignment, often used when vectors are normalized.
 
4️⃣ Indexing
- Searching vectors directly without an index is too slow for large datasets.
 - Indexing structures organize vectors so that nearest-neighbor queries can be answered efficiently.
 - These methods implement Approximate Nearest Neighbor (ANN) search → trading a bit of accuracy for major speed gains.
 
🔹 Popular ANN-based algorithms
Approximate Nearest Neighbor (ANN) search is a family of algorithms that find the closest vectors quickly by trading a small amount of accuracy for massive gains in speed and scalability — the following algorithms are all ANN-based.
HNSW (Hierarchical Navigable Small World Graph)
- Builds a multi-layer graph where each vector connects to its neighbors.
 - ✅ Fast queries, high recall, supports dynamic updates.
 - ❌ High memory consumption, complex to tune.
 
IVF (Inverted File Index)
- Clusters vectors into groups (using k-means or similar).
 - At query time, only the most relevant clusters are searched.
 - ✅ Efficient for large datasets, reduces search scope.
 - ❌ Accuracy depends on clustering quality.
 
PQ (Product Quantization)
- Compresses vectors into compact codes to save memory.
 - ✅ Great for billion-scale datasets, reduces storage dramatically.
 - ❌ Some loss in accuracy due to compression.
 
Annoy (Approximate Nearest Neighbors)
- Builds multiple random projection trees for searching.
 - ✅ Lightweight, simple to use, good for read-heavy static datasets.
 - ❌ Slower than HNSW at very large scale, poor for frequent updates.
 
ScaNN (Scalable Nearest Neighbors, Google)
- Optimized for high-dimensional, large-scale data.
 - ✅ Very fast, optimized for Google-scale workloads, low memory footprint.
 - ❌ Less community support, limited flexibility outside Google’s ecosystem.
 
🔹 Trade-offs
- Speed vs. Memory: Some indexes (like HNSW) are fast but memory-hungry.
 - Accuracy vs. Compression: PQ saves space but reduces precision.
 - Dynamic vs. Static Data: HNSW handles updates well, while Annoy is better for static data.
 
5️⃣ Filtering & Metadata
- Pure similarity search often returns results that are semantically close but not contextually relevant.
 - Real-world apps combine semantic similarity with structured filters (e.g., category, date, user ID).
 - Example: “find similar support tickets from the past month” or “recommend products in the Shoes category.”
 
🔹 How it works
- Each vector is stored with metadata → key-value pairs like:
- 
category: "electronics" - 
created_at: "2025-09-01" - 
user_id: 12345 
 - 
 - At query time, the DB runs a hybrid search: 1) Apply metadata filters. 2) Run similarity search only on the filtered subset.
 
🔹 Getting the query vector (two common options)
from openai import OpenAI
client = OpenAI()
q = "Wireless noise-cancelling headphones"
emb = client.embeddings.create(
    model="text-embedding-3-small",
    input=q
).data[0].embedding  # length = 1536
🔹 Example (Elasticsearch with vector + metadata filter)
POST /documents/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "knn": {
            "embedding": {
              "vector": [/* paste emb here, e.g., 0.12, -0.87, 0.33, ... */],
              "k": 3
            }
          }
        },
        { "term": { "category": "electronics" } }
      ]
    }
  }
}
This content originally appeared on DEV Community and was authored by FARHAN KHAN
FARHAN KHAN | Sciencx (2025-09-04T06:35:31+00:00) Vector Databases. Retrieved from https://www.scien.cx/2025/09/04/vector-databases/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.