SNB Driver - Part 3: Workload Execution Putting It All Together

by Alex Averbuch / on 20 Jan 2015

Up until now we have introduced the challenges faced when executing the LDBC SNB benchmark, as well as explained how some of these are overcome. With the foundations laid, we can now explain precisely how operations are executed.

Based on the dependencies certain operations have, and on the granularity of parallelism we wish to achieve while executing them, we assign a Dependency Mode and an Execution Mode to every operation type. Using these classifications the driver runtime then knows how each operation should be executed. These modes, as well as what they mean to the driver runtime, are described below.

Dependency Modes

While executing a workload the driver treats operations differently, depending on their Dependency Mode. In the previous section operations were categorized by whether or not they are in the sets Dependencies and/or Dependents.

Another way of communicating the same categorization is by assigning a Dependency Mode to operations - every operation type generated by a workload definition must be assigned to exactly one Dependency Mode. Dependency modes define dependencies, constraints on operation execution order. The driver supports a number of different Dependency Modes: None, Read Only, Write Only, Read Write. During workload execution, operations of each type are treated as follows:

• None

Depended On (NO): operations do not introduce dependencies with other operations (i.e., the correct execution of no other operation depends on these operations to have completed executing)

– Prior Execution: do nothing – After Execution: do nothing

• Read Only

Depended On (NO): operations do not introduce dependencies with other operations (i.e., the correct execution of no other operation depends on these operations to have completed executing)

Dependent On (YES): operation execution does depend on GCT to have advanced sufficiently (i.e., correct execution of these operations requires that certain operations have completed execution)

– Prior Execution: wait for GCT >= operation.DepTime – After Execution: do nothing

• Write Only

Depended On (YES): operations do introduce dependencies with other operations (i.e., the correct execution of certain other operations requires that these operations to have completed executing, i.e., to advance GCT)

Dependent On (NO): operation execution does not depend on GCT to have advanced sufficiently (i.e., correct execution of these operations does not depend on any other operations to have completed execution)

– Prior Execution: add operation to Initiated Operations

– After Execution: remove operation from Initiated Operations, add operation to Completed Operations

• Read Write

Depended On (YES): operations do introduce dependencies with other operations (i.e., the correct execution of certain other operations requires that these operations to have completed executing, i.e., to advance GCT)

Dependent On (YES): operation execution does depend on GCT to have advanced sufficiently (i.e., correct execution of these operations requires that certain operations have completed execution)

– Prior Execution: add operation to Initiated Operations, wait for GCT < operation.DepT

– After Execution: remove operation from Initiated Operations, add operation to Completed Operations

Execution Modes

Execution Modes relate to how operations are scheduled, when they are executed, and what their failure conditions are. Each operation type in a workload definition must be assigned to exactly one Execution Mode. The driver supports a number of different Execution Modes: Asynchronous, Synchronous, Partially Synchronous. It splits a single workload operation stream into multiple streams, zero or more steams per Execution Mode. During workload execution, operations from each of these streams are treated as follows.

• Asynchronous: operations are executed individually, when their Due Time arrives.

Motivation: This is the default execution mode, it executes operations as true to the workload definition as possible.

– Re-scheduling Before Execution: None: operation.DueT not modified by scheduler – Execute When time >= operation.DueT (and GCT >= operation.DepT)

– Max Concurrent Executions: unbounded

– Max Execution Time: unbounded

– Failure: operation execution starts later than: operation.DueT Tolerated Delay

• Synchronous: operations are executed individually, sequentially, in blocking manner.

Motivation: Some dependencies are difficult to capture efficiently with SafeT and GCT alone. For example, social applications often support conversations via posts and likes, where likes depend on the existence of posts. Furthermore, posts and likes also depend on the existence of the users that make them. However, users are created at a lower frequency than posts and likes, and it can be assumed they do not immediately start creating content. As such, a reasonably long SafeT can be used between the creation of a user and the first time that user creates posts or likes. Conversely, posts are often replied to and/or liked soon after their creation, meaning a short SafeT would be necessary to maintain the ordering dependency. Consequently, maintaining the dependencies related to conversations would require a short SafeT, and hence a small window. This results in windows containing fewer operations, leading to less potential for parallelism within windows, less freedom in scheduling, more synchronization, and greater likelihood of bursty behavior - all negative things.

The alternative offered by Synchronous Execution is that, when practical, operations of certain types can be partitioned (e.g. posts and likes could be partitioned by the forum in which they appear), and partitions assigned to driver processes. Using the social application example from above, if all posts and likes were partitioned by forum the driver process that executes the operations from any partition could simply execute them sequentially. Then the only dependency to maintain would be on user operations, reducing synchronization dramatically, and parallelism could still be achieved as each partition would be executed independently, in parallel, by a different driver process.

– Re-scheduling Before Execution: None: operation.DueT not modified by scheduler

– Execute When time >= operation.DueT and previousOperation.completed == true (and GCT >= operation.DepT)

– Max Concurrent Executions: 1

– Max Execution Time: nextOperation.DueT - operation.DueT

– Failure: operation execution starts later than: operation.DueT Tolerated Delay E.g., if previousOperation did not complete in time, forcing current operation to wait for longer than the tolerated-delay

• Partially Synchronous (Windowed Execution, described in Section 3.4 in more details), groups of operations from the same time window are executed together

– Re-scheduling Before Execution: Yes, as long as the following still holds:

window.startTime <= operation.DueT < window.startTime + window.duration

Operations within a window may be scheduled in any way, as long as they remain in the window from which they originated: their Due Times, and therefore ordering, may be modified

– Execute When time >= operation.DueT (and GCT >= operation.DepT)

– Max Concurrent Executions: number of operations within window

– Max Execution Time: (window.startTime + window.duration) - operation.DueT

– Failure: operation execution starts later than: window.startTime window.duration operation execution does not finish by: window.startTime + window.duration

Tying it back to LDBC SNB

The driver was designed to execute the workload of LDBC SNB. As discussed, the main challenge of running queries in parallel on graph-shaped data stem from dependencies introduced by the graph structure. In other words, workload partitioning becomes as hard as graph partitioning.

The LDBC SNB data can in fact be seen as a union of two parts:

  1. Core Data: relatively small and dense friendship graph (not more than 10% of the data). Updates on this part are very hard to partition among driver threads, since the graph is essentially a single dense strongly connected component.

  2. User Activity Data: posts, replies, likes; this is by far the biggest part of the data. Updates on this part are easily partitioned as long as the dependencies with the “core” part are satisfied (i.e., users don’t post things before the profiles are created, etc.).

In order to avoid friendship graph partitioning, the driver introduces the concept SafeT, the minimal simulation time that should pass between two dependent events.

This property is enforced by the data generator, i.e. the driver does not need to change or delay some operations in order to guarantee dependency safety. Respecting dependencies now means globally communicating the advances of the Global Completion Time, and making sure the operations do not start earlier than SafeT from their dependents.

On the other hand, the driver exploits the fact that some of the dependencies in fact do not hinder partitioning: although replies to the post can only be sent after the post is created, these kinds of dependencies are satisfied if we partition workload by forums. This way, all (update) operations on posts and comments from one forum are assigned to one driver thread. Since there is typically a lot of forums, each driver thread gets multiple ones. Updates from one forum are then run in Synchronous Execution Mode, and parallelism is achieved by running many distinct forums in parallel. By doing so, we can add posts and replies to forums at very high frequency without the need to communicate the GCT across driver instances (i.e. we efficiently create the so-called flash-mob effects in the posting/replying workload).

Tags: