OWL-Empowered SPARQL Query Optimization

by Irini Fundulaki, Giorgos Flouris / on 18 Feb 2015

The Linked Data paradigm has become the prominent enabler for sharing huge volumes of data using Semantic Web technologies, and has created novel challenges for non-relational data management systems, such as RDF and graph engines. Efficient data access through queries is perhaps the most important data management task, and is enabled through query optimization techniques, which amount to the discovery of optimal or close to optimal execution plans for a given query.

In this post, we propose a different approach to query optimization, which is meant to complement (rather than replace) the standard optimization methodologies for SPARQL queries. Our approach is based on the use of schema information, encoded using OWL constructs, which often accompany Linked Data.

OWL adopts the Open World Assumption and hence OWL axioms are perceived primarily to infer new knowledge. Nevertheless, ontology designers consider OWL as an expressive schema language used to express constraints for validating the datasets, hence following the Closed World Assumption when interpreting OWL ontologies. Such constraints include disjointness/equivalence of classes/properties, cardinality constraints, domain and range restrictions for properties and others.

This richness of information carried over by OWL axioms can be the basis for the development of schema-aware techniques that will allow significant improvements in the performance of existing RDF query engines when used in tandem with data statistics or even other heuristics based on patterns found in SPARQL queries. As a simple example, a cardinality constraint at the schema level can provide a hint on the proper join ordering, even if data statistics are missing or incomplete.

The aim of this post is to show that the richness of information carried over by OWL axioms under the Close World Assumption can be the basis for the development of schema-aware optimization techniques that will allow considerable improvement for query processing. To attain this objective, we discuss a small set of interesting cases of OWL axioms; a full list can be found here.

Schema-Based Optimization Techniques

Here we provide some examples of queries, which, when combined with specific schema constraints expressed in OWL, can help the optimizer in formulating the (near to) optimal query plans.

A simple first case is the case of constraint violation. Consider the query below, which returns all instances of class <A> which are fillers of a specific property <P>. If the underlying schema contains the information that the range of <P> is class <B>, and that class <B> is disjoint from class <A>, then this query should return the empty result, with no further evaluation (assuming that the constraints associated with the schema are satisfied by the data). An optimizer that takes into account schema information should return an empty result in constant time instead of trying to optimize or evaluate the large star join.

SELECT ?v 
WHERE { ?v rdf : type <A> .   
        ?u <P> ?v . ?u <P> ?v1 .   
        ?u <P1 > ?v2 . ?u <P2 > ?v3 .   
        ?u <P3 > ?v4 . ?u <P4 > ?v5}

Schema-aware optimizers could also prune the search space by eliminating results that are known a priori not to be in the answer set of a query. The query above is an extreme such example (where all potential results are pruned), but other cases are possible, such as the case of the query below, where all subclasses of class <A1> can immediately be identified as not being in the answer set.

SELECT ?c
WHERE { ?x rdf: type ?c . ?x <P> ?y . 
        FILTER NOT EXISTS \{ ?x rdf: type <A1 > }}

Another category of schema-empowered optimizations has to do with improved selectivity estimation. In this respect, knowledge about the cardinality (minimum cardinality, maximum cardinality, exact cardinality, functionality) of a property can be exploited to formulate better query plans, even if data statistics are incomplete, missing or erroneous.

Similarly, taking into account class hierarchies, or the definition of classes/properties via set theoretic constructs (union, intersection) at the schema level, can provide valuable information on the selectivity of certain triple patterns, thus facilitating the process of query optimization. Similar effects can be achieved using information about properties (functionality, transitivity, symmetry etc).

As an example of these patterns, consider the query below, where class <C> is defined as the intersection of classes <C1>, <C2>. Thus, the triple pattern (?x rdf:type <C>) is more selective than (?y rdf:type <C1>) and (?z rdf:type <C2>) and this should be immediately recognizable by the optimizer, without having to resort to cost estimations. This example shows also how unnecessary triple patterns can be pruned from a query to reduce the number of necessary joins. Figure 1 illustrates the query plan obtained when the OWL intersectionOf construct is used.

SELECT ?x 
WHERE { ?x rdf: type <C> . ?x <P1 > ?y . 
        ?y rdf : type <C1 > . ?y <P2 > ?z . ?z rdf : type <C2 > }

image

Schema information can also be used by the query optimizer to rewrite SPARQL queries to equivalent ones that are found in a form for which already known optimization techniques are easily applicable. For example, the query below could easily be transformed into a classical star-join query if we know (from the schema) that property P4 is a symmetric property.

SELECT ?y ?y1 ?y2 ?y3 
WHERE { ?x <P1 > ?y . ?x <P2 > ?y1 . 
        ?x <P3 > ?y2 . ?y3 <P4 > ?x }

Conclusion

In this post we argued that OWL-empowered optimization techniques can be beneficial for SPARQL query optimization when used in tandem with standard heuristics based on statistics. We provided some examples which showed the power of such optimizations in various cases, namely:

  • Cases where the search space can be pruned due to the schema and the associated constraints; an extreme special sub-case is the identification of queries that violate schema constraints and thus produce no results.
  • Cases where the schema can help in the estimation of triple pattern selectivity, even if statistics are incomplete or missing.
  • Cases where the schema can identify redundant triple patterns that do not affect the result and can be safely eliminated from the query.
  • Cases where the schema can be used for rewriting a query in an equivalent form that would facilitate optimization using well-known optimization techniques.

This list is by no means complete, as further cases can be identified by optimizers. Our aim in this post was not to provide a complete listing, but to demonstrate the potential of the idea in various directions.