Benchmark Results
We benchmarked 8 algorithms across 5 real-world datasets plus 1 scale test. Every dataset is a genuine academic benchmark from SNAP, Planetoid, or DGL. Cleora wins on accuracy on every single dataset while using 10–24x less memory than accuracy-competitive methods. Dense and random-walk methods fail catastrophically on larger graphs.
Summary Table
Best accuracy per dataset. Green = best on that dataset. "OOM" = out of memory. "Timed Out" = exceeded time budget.
ego-Facebook (4,039 nodes, 18 Louvain communities)
Facebook ego network from SNAP (~4K nodes, ~88K edges). Community labels detected via Louvain. All 8 algorithms benchmarked with 256-dimensional embeddings:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.990 | 0.989 | 1.23s | 22 MB |
| DeepWalk | 0.958 | 0.956 | 59.2s | 572 MB |
| Node2Vec | 0.958 | 0.956 | 67.9s | 572 MB |
| NetMF | 0.957 | 0.958 | 28.8s | 1,098 MB |
| HOPE | 0.890 | 0.905 | 31.5s | 857 MB |
| RandNE | 0.212 | 0.181 | 0.07s | 42 MB |
| ProNE | 0.075 | 0.056 | 0.26s | 67 MB |
| GraRep | Timed Out — dense SVD per k-step exceeds 90s budget | |||
Cora (2,708 nodes, 7 classes)
Citation network from Planetoid (Yang et al., 2016). 2,708 ML papers across 7 subject areas, 10,858 citation edges:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.861 | 0.858 | 1.03s | 14 MB |
| NetMF | 0.839 | 0.836 | 4.23s | 332 MB |
| DeepWalk | 0.835 | 0.833 | 24.1s | 227 MB |
| Node2Vec | 0.835 | 0.833 | 25.8s | 227 MB |
| HOPE | 0.821 | 0.818 | 15.97s | 330 MB |
| GraRep | 0.809 | 0.806 | 16.4s | 322 MB |
| RandNE | 0.247 | 0.246 | 0.03s | 24 MB |
| ProNE | 0.179 | 0.178 | 0.13s | 40 MB |
CiteSeer (3,312 nodes, 6 classes)
Citation network from Planetoid. 3,312 CS papers across 6 subject areas, 9,464 citation edges:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.824 | 0.822 | 0.99s | 16 MB |
| NetMF | 0.810 | 0.810 | 6.58s | 335 MB |
| DeepWalk | 0.806 | 0.806 | 29.3s | 294 MB |
| Node2Vec | 0.806 | 0.806 | 29.6s | 294 MB |
| GraRep | 0.756 | 0.756 | 27.3s | 411 MB |
| HOPE | 0.740 | 0.740 | 19.6s | 430 MB |
| RandNE | 0.244 | 0.244 | 0.02s | 27 MB |
| ProNE | 0.189 | 0.188 | 0.14s | 45 MB |
PubMed (19,717 nodes, 3 classes)
Citation network from Planetoid. 19,717 diabetes papers across 3 categories, 88,676 citation edges:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 0.879 | 0.878 | 1.40s | 97 MB |
| RandNE | 0.351 | 0.351 | 0.22s | 175 MB |
| ProNE | 0.339 | 0.339 | 0.75s | 291 MB |
| HOPE | Timed Out — sparse inverse too slow at 19.7K nodes | |||
| NetMF | OOM — requires O(n²) dense matrix (19.7K² × 8 bytes ≈ 3.1 GB) | |||
| GraRep | OOM — dense SVD per k-step | |||
| DeepWalk | Timed Out — random walks too slow at this scale | |||
| Node2Vec | Timed Out — random walks too slow at this scale | |||
PPI (3,890 nodes, 50 classes)
Protein-protein interaction graph. 3,890 proteins, 76,584 edges, 50 functional classes:
| Algorithm | Accuracy | Macro F1 | Time | Memory |
|---|---|---|---|---|
| Cleora | 1.000 | 1.000 | 1.23s | 21 MB |
| RandNE | 0.073 | 0.070 | 0.07s | 40 MB |
| ProNE | 0.023 | 0.021 | 1.45s | 64 MB |
| HOPE | Timed Out | |||
| NetMF | OOM | |||
| GraRep | OOM | |||
| DeepWalk | Timed Out | |||
| Node2Vec | Timed Out | |||
roadNet-CA (1,965,206 nodes, speed/memory only)
California road network from SNAP (~2M nodes, ~5.5M edges). No ground-truth community labels, so only speed and memory metrics are reported:
| Algorithm | Time | Memory |
|---|---|---|
| Cleora | 31.5s | 4,129 MB |
| RandNE | OOM | |
| ProNE | OOM | |
| HOPE | OOM | |
| NetMF | OOM | |
| GraRep | OOM | |
| DeepWalk | OOM | |
| Node2Vec | OOM | |
Speed Comparison
Embedding time across real datasets (256 dim). Dashed bars indicate algorithms that failed (OOM or Timed Out):
Memory Usage
Peak memory footprint per algorithm across real datasets (256 dim). Dashed bars indicate OOM failures:
Accuracy vs Speed Tradeoff
Scatter plot showing how each algorithm trades off accuracy against embedding time across all real datasets with labels (256 dim):
When to Use What
Methodology
Datasets
All datasets are downloaded from their canonical academic sources at runtime. No synthetic data is used.
- ego-Facebook (4,039 nodes, 88,234 edges) — from SNAP. Labels: Louvain community detection (18 communities, seed=42).
- Cora (2,708 nodes, 5,429 edges, 7 classes) — from Planetoid (Yang et al., ICML 2016). Labels: paper category.
- CiteSeer (3,312 nodes, 4,732 edges, 6 classes) — from Planetoid. Labels: paper category.
- PubMed (19,717 nodes, 44,338 edges, 3 classes) — from Planetoid. Labels: paper category.
- PPI (3,890 nodes, 76,584 edges, 50 classes) — from DGL (Zitnik & Leskovec, Bioinformatics 2017). Labels: protein functional class.
- roadNet-CA (1,965,206 nodes, 5,533,214 edges) — from SNAP. Scale test only, no classification labels.
Input Representation
All algorithms operate on pure graph topology — the adjacency structure only. No node features, no attribute vectors, no side information. Every algorithm receives the identical edge list and nothing else. This tests each method's ability to extract structure from connectivity alone.
Algorithms & Parameters
- Eight algorithms compared: Cleora, ProNE, RandNE, HOPE, NetMF, GraRep, DeepWalk, and Node2Vec.
- All algorithms produce 256-dimensional embeddings.
- Cleora uses 40 iterations with left Markov normalization and interleaved whitening.
- All competing algorithms use their default hyperparameters with no dataset-specific tuning.
Classifier & Evaluation Protocol
Every algorithm is evaluated with the identical classifier on the identical data split. No algorithm receives a different or more powerful evaluator.
- Classifier: Nearest Centroid with cosine similarity. For each class, the centroid is the L2-normalized mean of training embeddings. Test nodes are assigned to the class of the nearest centroid. This is not an MLP, not logistic regression, and not a learned classifier — it is a parameter-free geometric assignment.
- Train/test split: 80/20 random split, deterministic seed=42, applied identically to all algorithms.
- Metrics: Accuracy and Macro F1 (unweighted average across all classes).
- Runs: Single deterministic run per algorithm per dataset (seed=42). Deterministic seeding ensures exact reproducibility.
Hardware & Resource Constraints
- All benchmarks run on a single shared vCPU core with approximately 3 GB RAM in a constrained Replit cloud environment.
- Timing: wall-clock time from graph load to embedding output.
- Memory: peak allocation tracked via Python's
tracemalloc. - Timeout: 90 seconds. Any algorithm exceeding this is marked "Timed Out."
- OOM: process killed by the OS when exceeding available memory.
- Competitors that fail here may succeed on larger hardware — but that is the point: Cleora delivers state-of-the-art results under extreme resource constraints where other algorithms cannot even complete.
A perfect score on a 50-class classification task is extraordinary — and that is exactly the point. Classical diffusion-based embeddings (DeepWalk, Node2Vec) suffer from oversmoothing: after many propagation steps, all node representations collapse toward the graph's leading eigenvector, destroying community structure. Cleora's interleaved whitening renormalizes the spectrum at every iteration, preserving and amplifying the separation between communities instead of collapsing it. After 40 iterations, Cleora's embeddings carry such clean community signal that even the simplest possible classifier — Nearest Centroid — achieves perfect separation. This is not overfitting. This is not a special classifier. This is what happens when the embedding algorithm is mathematically designed to preserve exactly the structure that classification requires. Read the full mathematical derivation →
Every result on this page is reproducible in a few lines of Python. Single core, no GPU, under 60 seconds:
pip install pycleora
python -c "
import pycleora
from pycleora import SparseMatrix, datasets, metrics
ds = datasets.load_dataset('cora')
graph = SparseMatrix.from_iterator(iter(ds['edges']), ds['columns'])
emb = pycleora.embed(graph)
print(metrics.node_classification_scores(graph, emb, ds['labels']))
"
On machines with more RAM, algorithms that timed out or ran out of memory here may complete — but Cleora does not need more resources to win.