ML Analytics

ML Analytics Suite

Comprehensive machine learning algorithms for clustering, dimensionality reduction, outlier detection, and embedding quality assessment—all in SQL.

Clustering Algorithms

K-Means Clustering

Lloyd's K-Means with k-means++ initialization for finding customer segments, topic clusters, and data grouping.

-- CPU K-Means SELECT cluster_kmeans( 'customer_data', -- table 'features', -- vector column 5, -- number of clusters 100 -- max iterations ); -- GPU K-Means (23x faster) SELECT cluster_kmeans_gpu( 'customer_data', 'features', 5, 100 ); -- Get cluster assignments SELECT id, cluster_id, centroid_distance FROM neurondb_cluster_assignments('customer_data', 'features', 5) ORDER BY cluster_id, centroid_distance LIMIT 100;
O(n·k·i·d)
Time Complexity
23x GPU
Speedup on GPU
k-means++
Initialization

DBSCAN (Density-Based)

Density-based clustering that automatically discovers the number of clusters and identifies outliers.

-- DBSCAN clustering (auto-discovers cluster count) SELECT cluster_dbscan( 'customer_data', 'features', 0.5, -- epsilon (neighborhood radius) 5 -- min_points (minimum cluster size) ); -- Get clusters and outliers SELECT cluster_id, COUNT(*) as size FROM neurondb_dbscan_assignments('customer_data', 'features', 0.5, 5) GROUP BY cluster_id ORDER BY cluster_id; -- cluster_id = -1 means outlier

Dimensionality Reduction

PCA (Principal Component Analysis)

Reduce high-dimensional vectors to lower dimensions while preserving variance.

-- Reduce dimensions: 768 → 128 SELECT reduce_dimensionality_pca( 'embeddings_table', 'vector_column', 128 -- target dimensions ); -- Returns: {"components": 128, -- "explained_variance": [0.45, 0.23, 0.12, ...], -- "total_variance_explained": 0.80} -- 80% of information retained with 83% size reduction

Outlier Detection

Isolation Forest

Detect anomalies and unusual patterns in your vector data using Isolation Forest algorithm.

-- Detect outliers with 95% confidence SELECT detect_outliers( 'customer_data', 'features', 0.95 -- confidence level ) AS outlier_count; -- Get outlier details SELECT id, anomaly_score FROM neurondb_outlier_scores('customer_data', 'features', 0.95) WHERE is_outlier = true ORDER BY anomaly_score DESC;