SNB Interactive Part 2: Modeling Choices

by Orri Erling / on 26 May 2015
Note: this post is a continuation of "SNB Interactive Part 1: What is SNB Interactive Really About?" post by Orri Erling.

​SNB Interactive is the wild frontier, with very few rules. This is necessary, among other reasons, because there is no standard property graph data model, and because the contestants support a broad mix of programming models, ranging from in-process APIs to declarative query.

In the case of Virtuoso, we have played with SQL and SPARQL implementations. For a fixed schema and well known workload, SQL will always win. The reason for this is that this allows to materialize multi-part indices and data orderings that make sense for the application. In other words, there is transparency into physical design. An RDF application may also have physical design by means ofstructure-aware storage but this is more complex and here we are just concerned with speed and having things work precisely as we intend.

Schema Design

SNB has a regular schema described by a UML diagram. This has a number of relationships of which some have attributes. There are no heterogenous sets, e.g. no need for run-time typed attributes or graph edges with the same label but heterogeneous end points. Translation into SQL or RDF is straightforward. Edges with attributes, e.g. the knows relation between people would end up represented as a subject with the end points and the date since as properties. The relational implementation has a two-part primary key and the date since as a dependent column. A native property graph database would use an edge with an extra property for this, as such are typically supported.

The only table-level choice has to do with whether posts and comments are kept in the same or different data structures. The Virtuoso schema has a single table for both, with nullable columns for the properties that occur only in one. This makes the queries more concise. There are cases where only non-reply posts of a given author are accessed. This is supported by having two author foreign key columns each with its own index. There is a single nullable foreign key from the reply to the post/comment being replied to.

The workload has some frequent access paths that need to be supported by index. Some queries reward placing extra columns in indices. For example, a common pattern is accessing the most recent posts of an author or group of authors. There, having a composite key of ps_creatorid, ps_creationdate, ps_postid pays off since the top-k on creationdate can be pushed down into the index without needing a reference to the table.

The implementation is free to choose data types for attributes, specifically datetimes. The Virtuoso implementation adopts the practice of the Sparksee and Neo4J implementations and represents this is a count of milliseconds since epoch. This is less confusing, faster to compare and more compact than a native datetime datatype that may or may not have timezones etc. Using a built-in datetime seems to be nearly always a bad idea. A dimension table or a number for a time dimension avoids the ambiguities of a calendar or at least makes these explicit.

The benchmark allows procedurally maintaining materializations of intermediate results for use by queries as long as these are maintained transaction by transaction. For example, each person could have the 20 newest posts by immediate contacts precomputed. This would reduce Q2 “top of the wall” to a single lookup. This dows not however appear to be worthwhile. The Virtuoso implementation does do one such materialization for Q14: A connection weight is calculated for every pair of persons that know each other. This is related to the count of replies by one or the other to content generated by the other. If there does not exist a single reply in either direction, the weight is taken to be 0. This weight is precomputed after bulk load and subsequently maintained each time a reply is added. The table for this is the only row-wise structure in the schema and represents a half matrix of connected people, i.e. person1, person2 -> weight. Person1 is by convention the one with the smaller p_personid. Note that comparing id’s in this way is useful but not normally supported by RDF systems. RDF would end up comparing strings of URI’s with disastrous performance implications unless an implementation specific trick were used.

In the next installment we will analyze an actual run.

SNB Interactive Series