Wednesday, August 04, 2004

Rdf, indices, hypergraphs, and iTql

An rdf database is a set of declarative statements: [subject predicate object]. Now the traditional way of representing this is as a traditional directed-graph with 'object' nodes with 'predicate' edges linking them. This is a useful representation which successfully visualises most rdf-graphs. Unfortunately for visualisers, for rdf to be useful you need to be able to attach semantics to predicates, and the logical way to do this is to make statements. A quick example:

Fred parentOf John
John parentOf Lucy
John parentOf Jane
This becomes much more useful if we can utilise statements such as
parentOf   subsetOf    relativeOf
relativeOf hasProperty transitive
relativeOf hasProperty symmetric
which allows us to match against match(Jane relativeOf $x) and get back { x=Fred, John, Lucy, Jane }

This is all well and good until you try and visualise the union of these two graphs (often called base-facts, and schema). The first problem you face is multiple edges between the verticies - forcing you to move to a 'directed-graph with named edges'; not too hard to visualise, but it isn't looking too much like a traditional graph anymore. The real kicker comes when you try and visualise the relationship between parentOf and relativeOf, now you have an edge between two edges -- and we're not in Kansas anymore Toto.

So what do we have? It is certainly 'graph-like', but it long ago ceased to be a traditional graph. More importantly, we haven't started considering 'statings', 'named-graphs', or 'reification'. The real problem is that a predicate is a valid vertex in the graph; an equal partner (conceptually) with its more numerous subject and object bretheren. Fortunately we're not the first to consider this problem. First consider the definition of a graph (preempting my summary of Trudeau Ch 2).

A graph is an object consisting of two sets called its vertex set and its edge set. The vertex set is a finite nonempty set. The edge set is a set of two-element subsets of the vertex set.
Looking at the definition, it becomes quite obvious that what we need is for the edge set to contain three-element subsets of the vertex set. This conveniently has a name: a 3-hypergraph. More importantly it can be easilly expanded to support named-graphs by using a 4-hypergraph, and as Paul Gearon noted, this has immediate relevance to the indicies kowari uses. It also has immediate relevance to crystalising iTql's semantics, allowing us to integrate the trans(), walk(), and the new ???() graph constraints --- we're still arguing over a name for what I like to call exclude(); and the original Constraint constraint (include()???). Each of these then returns an element of the powerset of VxVxV (where V is the vertex set). The elements of these resolved constraints are then statements, and as such amenable to predicate logic; and iTql provides conjunction and disjunction operations (and, or) and returns a sum-of-products solution to the resulting compound predicate.

So iTql provides four levels of operations:

Graph Operations
union, intersection, and disjoint-union of 3-hypergraphs in the FROM clause. Note the result of this is actually a 4-hypergraph.
Graph Constraints
inclusive, transitive, reachability, and soon exclusive constraints against a specific 4-hypergraph (by default the graph returned from the FROM clause). Note these are parametised by variables, and the result is a set of variable bindings that substitute to statisfy the constraint. These form the primitive constraints in an iTql WHERE clause.
Predicate Logic Expressions
conjunction and disjunction of variable binding sets, and type specific boolean functions (ie. comparison operators >, <, ==, etc) forming the rest of the WHERE clause. The result of this is a product-of-sums we represent as an implementation of the 'Tuples' interface. It is probably worth noting two things. First, a Tuples fails to be a relation only because the individual variable binding sets have differing cardinalities; we fake this by padding with a unique UNBOUND value which carries DONT-CARE semantics. Second, this can only happen when the expression includes a non-union compatable disjunction.
Projection, Aggregation, and Subqueries
ie. the SELECT clause: a list of functions to apply to the padded-product-of-sums returned by the WHERE clause; the results of which are zip'ed into the conventional tabular format most application developers find most useful.

No comments: