How Cleora Works
A technical deep-dive into Cleora's architecture — from mathematical foundations to implementation details.
Architecture Overview
Input: Edge List
TSV/space-separated edge file. Each line defines a hyperedge connecting entities. Column types control how entities interact.
alice item_laptop
bob item_keyboard
alice item_mouse
Sparse Transition Matrix
Construct a row-stochastic sparse matrix M where M[i][j] = probability of transitioning from node i to node j. For left-Markov: rows sum to 1. For symmetric: D-1/2 A D-1/2.
Deterministic Initialization
Hash-based initialization: each entity gets a deterministic d-dimensional vector derived from its string ID. No random seeds. Same entity always gets the same initial vector.
Iterative Propagation
Repeat k times: Et+1 = normalize(M × Et). Each multiplication aggregates every possible random walk of that length. After k iterations, each node's embedding encodes its k-hop neighborhood structure.
Output: Embedding Matrix
N × d matrix of L2-normalized embeddings. Each row is a unit vector. Cosine similarity between any two rows measures structural proximity in the original graph.
Core Algorithm
Cleora's algorithm is a form of spectral embedding. The key insight: a single sparse matrix multiplication M × E computes the weighted average of all neighbors' embeddings for every node simultaneously. This is mathematically equivalent to aggregating all possible random walks of length 1.
After k multiplications, the embedding of node v encodes:
- All paths of length ≤ k starting from v
- The probability of reaching every other node in k steps
- The structural role of v in its k-hop neighborhood
The L2 normalization after each iteration prevents embedding collapse and ensures the cosine similarity metric is meaningful.
E0 = hash_init(entity_ids, dim)
Et+1 = L2_normalize(M × Et)
output = Ek
Markov Propagation Types
Left Markov (default)
Row-stochastic: D-1A. Each row sums to 1. Equivalent to a random walk transition matrix. Better for co-occurrence/similarity tasks where you want "nodes visited by the same random walker are similar."
Use for: recommendations, similarity search, clustering
Symmetric Markov
D-1/2AD-1/2. Preserves spectral properties. The eigenvectors of this matrix are the graph's spectral embedding. Better for tasks requiring structural equivalence.
Use for: node classification, community detection, role discovery
Convergence Properties
Cleora converges rapidly due to L2 normalization constraining the embedding manifold. Practical guidelines:
| Iterations | Neighborhood | Best For |
|---|---|---|
| 2-3 | Local (2-3 hops) | Direct neighbors, immediate similarity |
| 4-5 | Medium (4-5 hops) | Community-level similarity, recommendations |
| 7+ | Global | Contextual similarity (like skip-gram), role equivalence |
Without whitening, embeddings converge to the principal eigenvector beyond ~10 iterations. With interleaved whitening (the default), the spectrum is renormalized at every step, preventing collapse and allowing 40 iterations to extract deep structural signal that shallow runs miss.
Rust Core (PyO3)
The performance-critical operations are implemented in Rust and exposed to Python via PyO3:
SparseMatrix (Rust)
- from_iterator — parses edge strings, builds adjacency in a single pass. Multithreaded via Rayon for large inputs.
- initialize_deterministically — SipHash-based vector initialization. No randomness, no seeds.
- left_markov_propagate / symmetric_markov_propagate — sparse matrix-vector multiplication with adaptive thread pool. Processes rows in parallel chunks sized to L2 cache.
- to_sparse_csr — exports the Markov matrix to scipy-compatible CSR format for advanced operations in Python.
The Rust layer handles ~95% of compute time. Python orchestrates iteration and normalization. This hybrid approach gives Rust speed with Python flexibility.
Python API Layer
The Python layer provides high-level APIs on top of the Rust core:
pycleora (core)
embed(), embed_multiscale(), embed_with_attention(), supervised_refine(), CleoraEmbedder
pycleora.algorithms
ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, Node2Vec
pycleora.metrics
AUC, MRR, Hits@K, MAP@K, nDCG, F1, ARI, Silhouette
pycleora.sampling
neighborhood, subgraph, GraphSAINT, negative, train/test splits
pycleora.viz
t-SNE, UMAP, PCA reduction + matplotlib plotting
pycleora.io_utils
NetworkX, PyG, DGL export. NPZ/CSV/TSV/Parquet save/load
Memory Model
Cleora's memory usage is dominated by two structures:
- Sparse matrix — O(edges) memory. CSR format stores only non-zero values. A graph with 5M edges uses ~60 MB for the matrix.
- Embedding matrix — O(nodes × dim) memory. 2M nodes × 1024 dims × 4 bytes = ~8 GB.
Total for roadNet-CA (2M nodes, 5.5M edges, dim=1024): ~9 GB. Compare: NetMF requires O(nodes²) for the dense log matrix — 2M² × 8 bytes = 32 TB. This is why NetMF crashes on large graphs.
Why Not Random Walks?
Traditional methods (DeepWalk, Node2Vec) sample random walks, then train a skip-gram model on the walk sequences. This has three fundamental limitations:
Random Walk Methods
- Sample a small subset of possible walks
- Require negative sampling (approximation)
- Non-deterministic: different runs give different results
- Need many walks per node for convergence
- Skip-gram training adds O(walks × walk_length × dim) time
Cleora (Matrix Propagation)
- Computes ALL walks simultaneously via matrix multiplication
- No sampling, no approximation
- Fully deterministic: same input = same output
- Converges in 3-5 iterations
- Time complexity: O(iterations × edges × dim)
The result: Cleora is typically 100-240x faster than random walk methods on the same graph, with equal or better accuracy on node classification benchmarks.