SNB Interactive is the wild
frontier, with very few rules. This is necessary, among other reasons,
because there is no standard property graph data model, and because the
contestants support a broad mix of programming models, ranging from
in-process APIs to declarative query.
In the case of Virtuoso, we have played with SQL and SPARQL implementations. For a fixed schema and well known workload, SQL will always win. The reason for this is that this allows to materialize multi-part indices and data orderings that make sense for the application. In other words, there is transparency into physical design. An RDF application may also have physical design by means ofstructure-aware storage but this is more complex and here we are just concerned with speed and having things work precisely as we intend.
SNB has a regular schema described by a
This has a number of relationships of which some have attributes.
There are no heterogenous sets, e.g. no need for run-time typed
attributes or graph edges with the same label but heterogeneous end
points. Translation into SQL or RDF is straightforward. Edges with
attributes, e.g. the knows relation between people would end up
represented as a subject with the end points and the date since as
properties. The relational implementation has a two-part primary key and
the date since as a dependent column. A native property graph database
would use an edge with an extra property for this, as such are typically
The only table-level choice has to do with whether
comments are kept in the same or different data structures. The Virtuoso
schema has a single table for both, with nullable columns for the
properties that occur only in one. This makes the queries more concise.
There are cases where only non-reply posts of a given author are
accessed. This is supported by having two author foreign key columns
each with its own index. There is a single nullable foreign key from the
reply to the post/comment being replied to.
The workload has some frequent access paths that need to be supported by
index. Some queries reward placing extra columns in indices.
For example, a common pattern is accessing the most recent posts of
an author or group of authors. There, having a composite key
ps_postid pays off since the to-k
creationdate can be pushed down into the index without needing
a reference to the table.
The implementation is free to choose data types for
attributes, specifically datetimes. The Virtuoso implementation adopts
the practice of the
Neo4J implementations and represents
this is a count of milliseconds since epoch. This is less confusing,
faster to compare and more compact than a native datetime datatype that
may or may not have timezones etc. Using a built-in datetime seems to be
nearly always a bad idea. A dimension table or a number for a time
dimension avoids the ambiguities of a calendar or at least makes these
The benchmark allows procedurally maintaining materializations of intermediate results for use by queries as long as these are maintained transaction by transaction. For example, each person could have the 20 newest posts by immediate contacts precomputed. This would reduce Q2 “top of the wall” to a single lookup. This dows not however appear to be worthwhile. The Virtuoso implementation does do one such materialization for Q14: A connection weight is calculated for every pair of persons that know each other. This is related to the count of replies by one or the other to content generated by the other. If there does not exist a single reply in either direction, the weight is taken to be 0. This weight is precomputed after bulk load and subsequently maintained each time a reply is added. The table for this is the only row-wise structure in the schema and represents a half matrix of connected people, i.e.
Person1 is by convention the one with the smaller
p_personid. Note that comparing id’s in this way is useful but not normally supported by RDF systems. RDF would end up comparing strings of URI’s with disastrous performance implications unless an implementation specific trick were used.
In the next installment we will analyze an actual run.