A LINQ provider for RDF files - part 2

For the simple Rdf queries like

IQueryable<string> q = from x in rdf

                 from y in rdf

                 where rdf.A(germany, hasAdminDiv, x)

                    && rdf.A(x, isOfType, germanState)

                    && rdf.A(x, hasName, y)
select y.Val + " [" + x.Val + "]";

 

which we are going to support here there is a “normal form” given by

- a set of variables, which denote resources or values in an RDF document – in the example above this is {x,y}.

- a set of constraint triples (subj, pred, obj) where subj, pred, obj are either variables or constants. This is the query condition – in the above example it is
{(germany,hasAdminDiv,x),(x,isOfType,germanState),(x,hasName,y) }

- a “projection function” using these variables which denotes the value which we associate with each “row” – in the above example this is
(x,y) => y.Val + " [" + x.Val + "]"

 

To execute such a query means finding all possible assignments of resources / values to the variables such that all resulting triples are in the axioms of the RDF file, and then applying the projection function to get a set of objects of a certain type (the return type of the projection function – in the above example this is string).

 

The compiler will treat the above query expression as syntactic sugar for an expression like:

rdf.SelectMany(x => rdf.Where(y => Cond(x,y))

                      .Select(y => f(x,y))

               )

where Cond(x,y) is the condition involving rdf.A and f(x,y) is the function that assigns a string to each pair (x,y) of values in the Rdf document.

 

The same query could be written in different forms: For example replacing an expression

  rdf.Where(y => Cond1(x,y) && Cond2(x,y))

by

  rdf.Where(y => Cond1(x,y)).Where(z => Cond2(x,z))

should lead to the same normal form.

 

So how do we get LINQ to translate these expressions to the above normal form?

 

To get LINQ started, our Rdf type has to implement an IQueryable<T> interface, like the System.Data.DLinq.Table<T> does. When we query a database table without conditions, we get the set of all rows in the table. The analog notion for an RDF file (or RDF files, or any set of Rdf triples) is the set of all “Values” in the RDF document, so we implement the interface IQueryable<Value> on Rdf.

 

“Value” is the common base type of Literal (meaning a string occurring in an object position in an axiom) and Resource (given by a URI occurring in any position in any axiom).

Since we usually do not really want to retrieve all values occurring in a document, it does not matter too much what exactly we get when we foreach over a document (e.g. all values or only the resources?), what is more important is the IQueryable part, since that means that now the query operators Where, Select, SelectMany are defined for Rdf.

 

The basic observation is that we now can give the normal form of a query corresponding to a Rdf object (variables: {x}, constraints: {}, projection: x => x), and we can recursively determine the normal form of a query which is constructed out of these with the operators Where, Select and SelectMany.
There is some fine print:

1) Variables and variable names:
In rdf.Where(y => Cond1(x,y)).Where(z => Cond2(x,z)) the names y and z correspond to the same variable (which runs over the rdf at the beginning of this expression). We have to be careful to distinguish between variables (that the solver will assign to values) and named references to these variables (like “y” and “z” above).

2) Variables can be defined outside of a (sub)expression:
In rdf.Where(y => Cond1(x,y)) the variable x is defined in an enclosing scope. When we translate a (sub)expression, we always have to give the list of variables in the enclosing scope as a parameter.

3) Some restrictions apply:
- We only deal with Where, Select, SelectMany when applied to a Rdf query with identity projection function, i.e. the output is given by a variable and is a sequence of objects of type Value (e.g. not to a sequence of strings).
- The conditions in the Where clause only are of the form Rdf.A(?,?,?), the predicate is always given as a constant, and at least one of the entries is a variable.

With these caveats, here is what this recursive algorithm does:

- Where:
Source.Where(v => Cond(v)):
Translate the query expression Source. Assume the output of Source is a variable. Make the name v point to the same variable, translate the condition and add the result to the list of constraints.
The output variable of the new query expression is the same as for Source.

- SelectMany:
Source.SelectMany(v => Seq(v)):
Translate the query expression Source. Assume the output of Source is given by a variable. Make the name v point to the same variable. Add the variables and constraints of Source and Seq together. The projection function of the result is the projection function of Seq.

- Select:
Source.Select(v=>f(v)):
Translate the query expression Source. Assume the output of Source is given by a variable. Make the name v point to the same variable. Determine all parameters occurring in f, build a Lambda expression (v1,v2,..,vn) => f(v1,v2,…,vn) and compile it. This is the projection function of the result. The variables and constraints of the result are the same as from source.

 

I attach a VS2005 solution which implements this algorithm. It assumes the May LINQ CTP is installed.

It contains four projects:

- LinqToRdf is the main project which implements this algorithm
It uses an ITriplePovider object which enumerates triples, and an ISolver object that implements a solution algorithm that takes “local information” about the possibilities to complete a triple when the predicate and maybe one of subject and object are given, and computes all the possible solutions of a given query (given as a set of query triples).

- RdfXmlReader is an implementation of ITripleProvider which reads in an RdfXml file. It uses Drive (see last blog entry), you have to modify the reference to Drive.dll in this project to point to your copy of Drive.dll.

- SimpleSolver implements a simple algorithm to solve an Rdf query in the above normal form.

- Demo uses these assemblies to read in the RDF files containing information about Germany and France and list all “administrative divisions” of Germany and France.

As always, this sample code is the product of Weekend Evening Rapid Prototyping, it is provided as-is and does not come with any warranty.
You can copy, modify, and use the code for commercial and non-commercial purposes.

To build the RdfXmlReader project, you need to download Drive.dll from https://www.driverdf.org/, see there for legal restrictions which may apply to this DLL.

RdfReader.zip