This post is the first in a series of blogs analyzing the LDBC Social Network Benchmark Interactive workload. This is written from the dual perspective of participating in the benchmark design and of building the OpenLink Virtuoso implementation of same.
With two implementations of SNB interactive at four different scales, we can take a first look at what the benchmark is really about. The hallmark of a benchmark implementation is that its performance characteristics are understood and even if these do not represent the maximum of the attainable, there are no glaring mistakes and the implementation represents a reasonable best effort by those who ought to know, namely the system vendors.
The essence of a benchmark is a set of trick questions or choke points, as LDBC calls them. A number of these were planned from the start. It is then the role of experience to tell whether addressing these is really the key to winning the race. Unforeseen ones will also surface.
So far, we see that SNB confronts the implementor with choices in the following areas:
- Data model: Relational, RF, property graph?
- Physical model, e.g. row-wise vs. column wise storage
- Materialized data ordering: Sorted projections, composite keys, replicating columns in auxxiliary data structures
- Maintaining precomputed, materialized intermediate results, e.g. use of materialized views, triggers
- Query optimization: join order/type, interesting physical data orderings, late projection, top k, etc.
- Parameters vs. literals: Sometimes different parameter values result in different optimal query plans
- Predictable, uniform latency: The measurement rules stipulate the the SUT must not fall behind the simulated workload
- Durability - how to make data durable while maintaining steady throughput? Logging vs. checkpointing.
In the process of making a benchmark implementation, one naturally encounters questions about the validity, reasonability and rationale of the benchmark definition itself. Additionally, even though the benchmark might not directly measure certain aspects of a system, making an implementation will take a system past its usual envelope and highlight some operational aspects.
- Data generation - Generating a mid-size dataset takes time, e.g. 8 hours for 300G. In a cloud situation, keeping the dataset in S3 or similar is necessary, re-generating every time is not an option.
- Query mix - Are the relative frequencies of the operations reasonable? What bias does this introduce?
- Uniformity of parameters: Due to non-uniform data distributions in the dataset, there is easily a 100x difference between a ‘fast’ and ‘slow’ case of a single query template. How long does one need to run to balance these fluctuations?
- Working set: Experience shows that there is a large difference between almost warm and steady state of working set. This can be a factor of 1.5 in throughput.
- Are the latency constraints reasonable? In the present case, a qualifying run must have under 5% of all query executions starting over 1 second late. Each execution is scheduled beforehand and done at the intended time. If the SUT does not keep up, it will have all available threads busy and must finish some work before accepting new work, so some queries will start late. Is this a good criterion for measuring consistency of response time? There are some obvious possibilities of abuse.
- Is the benchmark easy to implement/run? Perfection is open-ended and optimization possibilities infinite, albeit with diminishing returns. Still, getting startyed should not be too hard. Since systems will be highly diverse, testing that these in fact do the same thing is important. The SNB validation suite is good for this and given publicly available reference implementations, the effort of getting started is not unreasonable.
- Since a Qualifying run must meet latency constraints while going as fast as possible, setting the performance target involves trial and error. Does the tooling make this easy?
- Is the durability rule reasonable? Right now, one is not required to
do checkpoints but must report the time to roll forward from the last
checkpoint or initial state. Incenting vendors to build faster recovery
is certainly good, but we are not through with all the implications.
What about redundant clusters?
The following posts will look at the above in light of actual experience.