Machine Learning

Understanding Embeddings

Learn what embeddings are, why they matter, and how to use them effectively in NeurondB

What Are Embeddings?

Embeddings are dense vector representations of data (text, images, audio) that capture semantic meaning in a high-dimensional space. Unlike traditional keyword-based representations, embeddings encode contextual relationships, allowing machines to understand similarity and meaning.

Key Concept

Traditional Search: Matches exact keywords → "machine learning" only finds documents with those exact words

Semantic Search (Embeddings): Understands meaning → "machine learning" also finds "neural networks", "AI models", "deep learning"

How Embeddings Capture Similarity:

Text "cat"    → [0.8, 0.2, 0.1, ...]     ┐
Text "kitten" → [0.75, 0.25, 0.12, ...]   ├─ Close together (similar meaning)
Text "dog"    → [0.7, 0.3, 0.15, ...]     ┘

Text "car"    → [-0.3, 0.9, -0.5, ...]    ← Far apart (different concept)

Why Embeddings Matter

Understanding Context

Embeddings capture context and meaning. The word "bank" has different embeddings near "river" vs "money" based on context.

Language Independence

Similar concepts in different languages have similar embeddings. Search in English, find results in Spanish/French/Chinese.

Multimodal Capabilities

Text, images, and audio can be embedded in the same space. Search for images using text descriptions!

Text Embeddings

Basic Usage

-- Generate embedding from text
SELECT embed_text('artificial intelligence');

-- Result: vector(384) containing the embedding
-- [0.234, -0.891, 0.456, ..., 0.123]

Choose a Model

-- Fast, efficient model (384 dimensions)
SELECT embed_text('machine learning', 'all-MiniLM-L6-v2');

-- Higher quality (768 dimensions)
SELECT embed_text('machine learning', 'all-mpnet-base-v2');

Batch Processing

Process multiple texts efficiently (3-5x faster than individual calls):

-- Embed multiple texts at once
SELECT embed_text_batch(
    ARRAY[
        'artificial intelligence',
        'machine learning',
        'deep learning'
    ],
    'all-MiniLM-L6-v2'
);

Image Embeddings

Convert visual information into vectors for reverse image search, classification, and multimodal search.

-- Embed image from binary data
SELECT embed_image(
    pg_read_binary_file('/path/to/image.jpg'),
    'clip'  -- CLIP model (text + image)
);

-- Search images using text
SELECT filename, 
       embedding <-> embed_text('sunset beach') AS distance
FROM images
ORDER BY distance
LIMIT 10;

🎨 Multimodal Magic: With CLIP embeddings, you can search for images using text descriptions and vice versa!

Complete Example: Document Search

1Create Table

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT,
    content TEXT,
    embedding vector(384)
);

2Insert Documents with Embeddings

INSERT INTO documents (title, content, embedding) VALUES
    ('Machine Learning', 'Introduction to ML...', 
     embed_text('Introduction to ML...')),
    ('Deep Learning', 'Neural networks...', 
     embed_text('Neural networks...'));

3Semantic Search

SELECT title, content,
       embedding <-> embed_text('AI algorithms') AS distance
FROM documents
ORDER BY distance
LIMIT 5;

Performance Tips

  • Batch processing: Use embed_text_batch() for 3-5x speedup
  • Caching: Use embed_cached() to avoid regenerating embeddings
  • Indexing: Create HNSW indexes on embedding columns for fast search
  • Model selection: Smaller models (384-dim) are faster, larger models (768-dim) are more accurate
  • Quantization: Use int8 or binary types for 4-32x storage savings