The Graphalytics benchmark is an industrial-grade benchmark for graph analysis platforms such as Giraph, Spark GraphX, and GraphBLAS. It consists of six core algorithms, standard data sets, and reference outputs, enabling the objective comparison of graph analysis platforms.
The benchmark harness consists of a core component, which is extendable by a driver for each different platform implementation. The benchmark includes the following algorithms:
- breadth-first search (BFS)
- PageRank (PR)
- weakly connected components (WCC)
- community detection using label propagation (CDLP)
- local clustering coefficient (LCC)
- single-source shortest paths (SSSP)
The choice of these algorithms was carefully motivated, using the LDBC TUC and extensive literature surveys to ensure good coverage of scenarios. The standard data sets include both real and synthetic data sets, which are classified into intuitive “T-shirt” sizes (S, M, L, etc.).
Each experiment set in Graphalytics consists of multiple platform runs (a platform executes an algorithm on a data set), and diverse set of experiments are carried out to evaluate different performance characteristics of a system-under-test.
All completed benchmarks must go through a strict validation process to ensure the integrity of the performance results.
The development of Graphalytics is supported by many active vendors in the field of large-scale graph analytics. Currently, Graphalytics already facilitates benchmarks for a large number of graph analytics platforms, such as GraphBLAS, Giraph, GraphX, and PGX.D, allowing comparison of the state-of-the-art system performance of both community-driven and industrial-driven platforms. To get started, the details of the Graphalyics documentation and its software components are described below.
Documents and repositories
- Benchmark specification. The source code is stored in the
- VLDB paper
ldbc_graphalytics: Generic driver
ldbc_graphalytics_platforms_umbra: Umbra implementation
ldbc_graphalytics_platforms_graphblas: GraphBLAS implementation
Graphalytics competition 2023
In 2023, we will hold a new round of the Graphalytics competition. See the LDBC Graphalytics Benchmark presentation for an introduction to the benchmark framework and the competition’s rules.
- benchmark framework
- reference implementations
- data sets (data sets and expected results) are available on GitHub
- Participation is free.
- There are no monetary prizes.
- Single-node and distributed implementations are allowed.
- Partial implementations (e.g. just small to mid-sized data sets and only a few algorithms) are allowed.
- Submissions should execute each algorithm-data set combination three times. From these, the arithmetic mean of the processing times is used for ranking.
- The results of the competition will be published on the LDBC website in the form of leaderboards, which rank them based on performance and price-performance (adjusted for the system price).
- There is a global leaderboard that includes all algorithms and scale factors. Additionally, there is a separate leaderboard for each scale (S, M, L, XL, 2XL+), algorithm and system category (CPU-based/GPU-based, single-node vs. distributed) to for fine-grained comparison.
- Submissions are subject to code review and reproducibility attempts from the organizers.
- System prices should be reported following the TPC Pricing specification.
Recommendations for submissions
- Submissions using modern hardware are welcome (GPUs, FPGAs, etc.).
- We encourage the use of cloud compute instances for running the benchmark (if possible).
- March 17: Competition is announced
- April 25: Confirmation of intent
- May 1: Submissions open
- June 15: Submissions close
The Graphalytics data sets are compressed using
zstd. The total size of the compressed archives is approx. 350GB. When decompressed, the data sets require approximately 1.5TB of disk space.
For detailed information on the data sets, see the table with their statistics.
The data sets are available in two locations:
A public Cloudflare R2 bucket
- This is the primary source for the data sets and is kept up-to-date upon changes
- The links in the table below point to this bucket
- Shell script to download the data sets from Cloudflare R2
- Download scripts for individual sizes: test graphs, sizes up to S, size M, size L, size XL, sizes 2XL+
CWI/SURFDrive data repository
- Backup repository
- This repository is kept up-to-date upon changes
- Shell script to download the data sets from SURFdrive
Some of the Graphalytics data sets were fixed in March 2023. Prior to this, they were incorrectly packaged or had missing/incorrect reference outputs for certain algorithms. If you are uncertain whether you have the correct versions, cross-check them against these MD5 checksums: