Cleora — Rust-Powered Graph Embedding Engine

0

Faster Than GraphSAGE

x

0

Less Memory Than NetMF

x

5 MB

Total Install Size

0 GPUs

Required. Ever.

Why Cleora

The Algorithm That Shouldn't Exist

Every other library needs random walks, negative sampling, and GPU clusters to approximate what Cleora computes exactly — with a single sparse matrix power on one CPU core. The result? Highest accuracy on real-world graphs where others score single digits.

01

Sparse Markov Matrix

Constructs a sparse transition matrix from your input graph. Handles heterogeneous hypergraphs with typed, multi-relational edges natively.

02

Matrix Powers = All Walk Distributions

Each iteration multiplies the embedding matrix by the sparse transition matrix — M^k captures the full distribution of all walks of length k. No sampling, no noise, no stochastic approximation. This is what makes Cleora deterministic and orders of magnitude faster.

03

L2-Normalized Propagation

Each iteration replaces every node's embedding with the L2-normalized average of its neighbors' embeddings. 3-4 iterations for co-occurrence similarity, 7+ for contextual similarity like skip-gram.

Key Advantages

What Makes Cleora Different

No Sampling, No Training

Unlike DeepWalk, Node2Vec, and LINE, Cleora eliminates both random walk sampling AND skip-gram training entirely. It captures all walk distributions exactly via matrix powers. No noise, perfect reproducibility.

240x Faster Than GraphSAGE

Zomato reported embedding generation in under 5 minutes with Cleora, compared to 20 hours with GraphSAGE on the same dataset. Rust core with adaptive parallelism makes every CPU cycle count.

Deterministic Embeddings

Same input always produces the same output. Deterministic by default — no stochastic variation, no "run it 5 times and average" workflows. Critical for reproducible research and production ML pipelines.

Heterogeneous Hypergraphs

Natively handles multi-type nodes and edges, bipartite graphs, and hypergraphs. TSV input with typed columns like complex::reflexive::product. No graph preprocessing needed.

5 MB, No Heavy Dependencies

The entire library is ~5 MB with only numpy and scipy. Compare: PyTorch Geometric is 500 MB+, DGL is 400 MB+. Cleora ships as a single compiled Rust extension. No CUDA, no cuDNN, no GPU driver headaches.

Stable & Inductive

Embeddings are stable across runs and support inductive learning: new nodes can be embedded without retraining the entire graph. Production-ready from day one.

Case Study

How Zomato Replaced GraphSAGE with Cleora

From 20 hours to under 5 minutes — powering recommendations for 80M+ users across 500+ cities

Zomato

Read the full blog post →

The Problem

Zomato's ML team needed graph embeddings to power "People Like You" restaurant recommendations. Their initial approach with GraphSAGE took ~20 hours just to process customer-restaurant interaction data for a single city region — making it impossible to scale across 500+ cities.

Customer-Restaurant Graph

Bipartite graph of customer orders and restaurant interactions across the Zomato platform

↓

Cleora Embeddings < 5 minutes

240x faster than GraphSAGE, 240x faster than DeepWalk (as measured by Zomato). No walk sampling, no skip-gram training. Purely structure-based — iterative weighted averaging of neighbor embeddings + L2 normalization.

↓

EMDE Density Estimation

Customer preferences modeled as probability density functions. Locality-sensitive hashing compresses multiple embedding vectors into single representations.

↓

Production Recommendations

Restaurant recommendations, search ranking, dish suggestions, and "People Like You" lookalikes — all powered by Cleora embeddings across 500+ cities.

240x

faster than DeepWalk

< 5 min

embedding generation

500+

cities scaled to

0

GPUs required

Also Used By

Trusted in Production Worldwide

"Cleora powers our core recommendation and personalization engine. Product embeddings from terabytes of e-commerce transactions — substitute vs. complement detection, customer segmentation, cold-start solving — all on CPU in minutes."

Synerise

AI/ML platform, billions of e-commerce events daily

"Personalized video recommendations with improved relevance and catalog coverage. Cleora embeddings integrated seamlessly into our existing ML pipeline."

Dailymotion

Video platform, 350M+ monthly visitors

Cleora-powered solutions achieved top placements in KDD Cup 2021, WSDM WebTour 2021, and SIGIR eCom 2020 — beating deep learning approaches on travel, e-commerce, and web recommendation benchmarks.

Competition Results

KDD Cup, WSDM, SIGIR

Recommendation Systems Knowledge Graphs Customer Lookalikes Entity Resolution Fraud Detection Social Networks Drug Discovery Supply Chain

How Cleora Works

From Raw Graph to Embeddings in Seconds

A deterministic pipeline that replaces random walks, skip-gram, and GPU training with pure linear algebra.

01

Input Data

Feed edge lists, interaction logs, or knowledge triples. Cleora accepts any TSV with typed columns — entities, relations, and modifiers in a single file.

02

Hypergraph Construction

Builds a heterogeneous hypergraph where a single edge can connect multiple entities of different types. No bipartite projections needed.

03

Sparse Markov Matrix

Constructs a sparse transition matrix from the graph. Rows are normalized so each row sums to 1 — a proper Markov chain over the entity space.

Sparsity

99%+ sparse

04

Matrix Power = All Walk Distributions

Each iteration applies one sparse matrix power — M^k captures the full distribution of all walks of length k. No sampling, no noise — this is what makes Cleora deterministic and fast.

Complete walk distributions, zero sampling

05

L2-Normalized Propagation

Each iteration replaces every node's embedding with the L2-normalized average of its neighbors. 3-4 iterations for co-occurrence similarity, 7+ for contextual similarity.

iter 1 iter 2 iter 3 iter 4

06

Embeddings Ready

Dense, deterministic embedding vectors for every entity — ready for downstream ML. Same input always yields same output, guaranteed reproducibility.

Recommendations Clustering Classification Similarity Search

Capabilities

Everything You Need in One Package

Minimal dependencies (just numpy + scipy). No GPU. Production-ready graph embeddings.

7 Alternative Algorithms

ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec — all included as comparison baselines under one API. Cleora is faster and leaner than every one of them, and beats accuracy across every benchmark.

MLP Classifier

MLP classifier and Label Propagation included — pure numpy/scipy, no PyTorch, no GPU. Evaluate embedding quality directly without external dependencies.

Rust-Powered Core

Sparse matrix operations in Rust with PyO3 bindings. Adaptive parallelism. 10-100x faster than pure Python implementations.

Rich Evaluation Suite

AUC, MRR, Hits@K, MAP@K, nDCG, ARI, Silhouette Score, and k-fold cross-validation. Evaluate without leaving the library.

Graph Sampling

Neighborhood, subgraph, and GraphSAINT mini-batching. Negative sampling and train/test edge splits for scalable link prediction.

Heterogeneous Graphs

Multi-type nodes and edges. Per-relation embedding, metapath-based embedding, and homogeneous export. Real-world data doesn't fit in simple graphs.

Hyperparameter Tuning

Grid search and random search with automatic evaluation. Find the optimal embedding configuration in one call across all 7 alternative algorithms.

Benchmarking Suite

Compare all 7 alternative algorithms against Cleora with time, memory, and accuracy metrics. Benchmark on your own graphs or use the 5 built-in graph generators. Publication-ready formatted tables included.

CLI Tool

pycleora embed --input graph.tsv --dim 1024 for scripting and CI/CD pipelines. Embed graphs without writing Python.

Benchmarks

8 Algorithms. 5 Datasets. Honest Results.

Every dataset below is a genuine academic benchmark — from SNAP, Planetoid, and DGL. We test against 7 competing algorithms (HOPE, NetMF, GraRep, DeepWalk, Node2Vec, ProNE, RandNE). Cleora wins on accuracy on every single dataset while using 10–24x less memory than accuracy-competitive methods.

ego-Facebook SNAP · 4K nodes · 88K edges

Cleora

0.990

1.23s

Node2Vec

0.958

67.9s

NetMF

0.957

28.8s

99.0% accuracy — beats all 7 competitors while using 50x less memory than NetMF (22 MB vs 1,098 MB). GraRep timed out entirely.

Cora Planetoid · 2.7K nodes · 7 classes

Cleora

0.861

1.03s

NetMF

0.839

4.2s

Node2Vec

0.835

25.8s

86.1% accuracy — beats NetMF (0.839) while using 24x less memory (14 MB vs 332 MB).

CiteSeer Planetoid · 3.3K nodes · 6 classes

Cleora

0.824

0.99s

NetMF

0.810

6.6s

DeepWalk

0.806

29.3s

82.4% accuracy — beats NetMF (0.810) while using 21x less memory (16 MB vs 335 MB).

PubMed Planetoid · 19.7K nodes · 3 classes

Cleora

0.879

1.40s

RandNE

0.351

0.22s

5 others

OOM / Timeout

✕

87.9% accuracy at 19.7K nodes. Only 3 of 8 algorithms survive — HOPE, NetMF, GraRep, DeepWalk, and Node2Vec all crash with OOM or timeout.

PPI 3.9K nodes · 77K edges · 50 classes

Cleora

1.000

1.23s

RandNE

0.073

0.07s

5 others

OOM / Timeout

✕

Perfect 1.000 accuracy on PPI with 50 classes. Only 3 of 8 algorithms complete — HOPE, NetMF, GraRep, DeepWalk, and Node2Vec all fail.

roadNet-CA SNAP · 1.96M nodes · 5.5M edges

Cleora

31.5s

4.1 GB

RandNE

OOM

✕

ProNE

OOM

✕

2 million nodes. 31 seconds. All 7 competitors crash. Cleora is the only library that survives at this scale.

Memory: Cleora Uses 10–50x Less Than Competitors

Facebook 4K

22 MB vs 1,098 MB

50x less

Cora 2.7K

14 MB vs 332 MB

24x less

CiteSeer 3.3K

16 MB vs 430 MB

27x less

PubMed 19.7K

97 MB vs 291 MB

3x less

PPI 3.9K

21 MB vs 64 MB

3x less

roadNet 2M

4.1 GB vs OOM

Only one

2.7K

Cora

→

3.9K

PPI

→

4K

Facebook

→

19.7K

PubMed

→

2M

roadNet

740x more nodes, only 115x more time — from 0.27s to 31.5s. Sub-linear scaling that competitors can only dream about.

Full Benchmark Results →

Open Source. Free Forever.

100% Free. 100% Accurate. 100% Yours.

Cleora is open-source software, free to use, modify, and deploy — no license fees, no API keys, no usage limits. Run it on your laptop, your server, or a cloud instance. Here's what the infrastructure costs look like when you do deploy:

Cleora (open source)

Your infrastructure — any CPU machine

License cost$0 — forever

Example cloudAWS x2iedn.16xlarge (1 TB RAM, $13.10/hr)

GPU requiredNone — pure CPU

<$0.02/job

2M nodes embedded in 31s. Your cost is just the cloud time — pennies per job. Or run it on your own hardware for $0.

VS

GPU-based alternatives

Require expensive GPU infrastructure

Infrastructure8× A100 GPUs ($40.45/hr)

VRAM ceiling640 GB hard limit

GPU requiredYes — mandatory

$40.45/hr

Graph exceeds VRAM? Method fails. No fallback. And you're paying 3x more for the privilege.

pip install pycleora

That's it. No sign-up, no API key, no subscription. Cleora is free, open-source software you install and own. When you run it on cloud infrastructure, you pay only for compute time — less than $0.02 to embed 2 million nodes. GPU-based methods need $40/hr machines with a hard 640 GB VRAM ceiling. Cleora uses ordinary RAM with no upper limit.

Comparison

16 Libraries. One Winner.

We compared pycleora against every major graph embedding library. The result is unambiguous.

Feature	pycleora 3.2	PyG	KarateClub	Original Cleora	DGL	Node2Vec	StellarGraph	GEM	GraphVite	DeepWalk	LINE	SDNE	graspologic	GraphSAGE	Struc2Vec	VERSE	NetSMF
CPU-only (no GPU needed)	Yes	Optional	Yes	Yes	Optional	Yes	Optional	Yes	No (GPU)	Yes	Yes	Optional	Yes	Optional	Yes	Yes	Yes
Rust-powered core	Yes	No (C++)	No	Yes	No (C++)	No	No (TF)	No	No (C++)	No	No (C++)	No	No	No	No	No (C++)	No (C++)
No negative sampling needed	Yes	No	No	Yes	No	No	No	Partial	No	No	No	Yes	Yes	No	No	No	Yes
Deterministic output	Yes	No	No	Yes	No	No	No	No	No	No	No	No	Partial	No	No	No	No
Node2Vec / DeepWalk	Built-in	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes	Yes	No	No	No	No	No	No	No
Built-in classifiers (no PyTorch)	MLP + Label Propagation	Requires PyTorch	No	No	Requires PyTorch	No	Requires TF	No	No	No	No	No	No	Requires TF	No	No	No
Graph generators	5	Some	No	No	Some	No	No	No	No	No	No	No	No	No	No	No	No
Graph sampling	6 methods	Yes	No	No	Yes	No	Yes	No	Yes	No	No	No	No	Yes	No	No	No
Hyperparameter tuning	Grid + Random	Manual	No	No	Manual	No	Manual	No	No	No	No	No	No	No	No	No	No
Install size	~5 MB	~500 MB+	~15 MB	~3 MB	~400 MB+	~2 MB	~600 MB+	~50 MB	~200 MB+	~5 MB	~5 MB	~300 MB+	~50 MB	~500 MB+	~5 MB	~5 MB	~10 MB
Multi-GPU support	Not Needed	Yes	No	No	Yes	No	Limited	No	Yes	No	No	No	No	No	No	No	No
Actively maintained	Yes	Yes	Yes	Minimal	Yes	Yes	Archived	Inactive	Inactive	Inactive	Inactive	Inactive	Yes	Inactive	Inactive	Inactive	Inactive

Feature comparison only. Performance benchmarks are on the Benchmarks page (5 real-world datasets from SNAP, Planetoid & DGL + 1 scale test).

Quick Start

From Edges to Embeddings in 5 Lines

from pycleora import SparseMatrix, embed, find_most_similar

# Build graph from edge list
edges = ["alice item_laptop", "alice item_mouse", "bob item_keyboard"]
graph = SparseMatrix.from_iterator(iter(edges), "complex::reflexive::product")

# Generate 1024-dimensional embeddings
embeddings = embed(graph, feature_dim=1024)

# Find similar entities
similar = find_most_similar(graph, embeddings, "alice", top_k=5)
for r in similar:
    print(f"{r['entity_id']}: {r['similarity']:.4f}")

Ready to Embed Your Graph?

Join Zomato, Dailymotion, Synerise, and ML teams worldwide using Cleora in production. Install in seconds, embed in minutes.

Read the Docs Star on GitHub

pip install pycleora

All Random Walks.One Matrix Multiply.

The Algorithm That Shouldn't Exist

Sparse Markov Matrix

Matrix Powers = All Walk Distributions

L2-Normalized Propagation

What Makes Cleora Different

No Sampling, No Training

240x Faster Than GraphSAGE

Deterministic Embeddings

Heterogeneous Hypergraphs

5 MB, No Heavy Dependencies

Stable & Inductive

How Zomato Replaced GraphSAGE with Cleora

The Problem

Customer-Restaurant Graph

Cleora Embeddings < 5 minutes

EMDE Density Estimation

Production Recommendations

Trusted in Production Worldwide

From Raw Graph to Embeddings in Seconds

Input Data

Hypergraph Construction

Sparse Markov Matrix

Matrix Power = All Walk Distributions

L2-Normalized Propagation

Embeddings Ready

Everything You Need in One Package

7 Alternative Algorithms

MLP Classifier

Rust-Powered Core

Rich Evaluation Suite

Graph Sampling

Heterogeneous Graphs

Hyperparameter Tuning

Benchmarking Suite

CLI Tool

8 Algorithms. 5 Datasets. Honest Results.

ego-Facebook SNAP · 4K nodes · 88K edges

Cora Planetoid · 2.7K nodes · 7 classes

CiteSeer Planetoid · 3.3K nodes · 6 classes

PubMed Planetoid · 19.7K nodes · 3 classes

PPI 3.9K nodes · 77K edges · 50 classes

roadNet-CA SNAP · 1.96M nodes · 5.5M edges

Memory: Cleora Uses 10–50x Less Than Competitors

100% Free. 100% Accurate. 100% Yours.

16 Libraries. One Winner.

From Edges to Embeddings in 5 Lines

Ready to Embed Your Graph?

All Random Walks.
One Matrix Multiply.