Thursday, November 29, 2007

How to start thinking about RDF.

While trying to understand RDF, its capabilities and its limitations I have done a lot of reading of earlier work on semi-structured data management. It occurs to me that I haven't blogged a paragraph from an 1995 paper on heterogeneous information sources that really crystalised for me the crucial difference between RDF and previous attempts at semi-structured data (especially XML and friends) - better late than never I suppose...

We need note define in advance the structure of an object ... no notion of a fixed schema or object class. A label can play two roles: identifying an object (component) and identifying the meaning of an object (component). If an information source exports objects with a particular label, then we assume that the source can answer the question "What does this label mean?". It is particularly important to note that labels are relative to the source that exports them. That is, we do not expect labels to be drawn from an ontology shared by all information sources.

Papakonstantinou, Y. et al. "Object Exchange Across Heterogeneous Information Sources", 11th Conference on Data Engineering, IEEE Computer Society, 251-260, 1995.

The actual framework presented in the paper is focused on intraorganisational integration and so doesn't have the property of anarchic scalability required if it is to be applied to the internet - however it does express clearly this need for source-defined semantics if semi-structured data is to be tractable.

Wednesday, October 17, 2007

Implementing a Resource Manger for the Java Transaction API.

One would hope that when an organisation like Sun specifies a standard as important as a transaction api they might take the time to ensure they do a reasonable effort -- unfortunately you would be wrong. The JTA specification is freely available from Sun, however the document is extremely vague and requires the reader to infer the underlying transaction state machine - with the unsurprising result that the corner cases are left completely unspecified. So having waded through this morass myself, I include here my advice on what you need to do to understand JTA, in the hope that anyone else coming to this standard won't have to waste as much time as I have understanding it.

Read the JTA spec last. JTA is a little more than a thin wrapper around the Object Transaction Service 1.1 published by the OMG. This is one of the CORBA standards published in the late 90's and early 2000's - and like most of the early CORBA specs, it is well written; easy to read; and pretty complete. Unfortunately it too leaves the underlying state-machine and some of the corner-cases (especially those to do with recovery) underspecified. As a result I recommend printing this out and referring to it as an adjunct to the JTA spec.

Fortunately the CORBA-OTS is itself an object-oriented wrapper around another spec, the Distributed Transaction Processing: The XA Specification spec' published by X/Open (now The Open Group). There in Chapter 2 you will find most of the definitions missing from the other two specs; and in Chapter 6 the state-tables that provide definitive semantics for the various operations you will need to implement. You will also find a reference to another related X/Open specification - Distributed Transaction Processing: Reference Model - which contains all the remaining definitions and assumptions missing from the the JTA/OTS specs.

So if you do need to implement a JTA interface I strongly recommend you start with the X/Open reference model; then read the X/Open XA Spec; and only then read the JTA specification alongside the OTS spec for elaboration.

Wednesday, September 19, 2007

Model-URI/URL Use-cases and Requirements and Proposal

Just posted this to mulgara-general - posting here to provide readily accessible permanent reference. I would greatly appreciate any comments anyone may have - please also feel free to solicit comments from outside the mulgara community if there is interest.

The Use Cases and Requirements
The three key requirements of a model-URI proposal are:

1. Protocol/Scheme independence
2. Model/Server mobility
3. URI-standards compliance (ie. no fragment)

Also desirable are

4. Unique-name
5. Namespaced to allow a) potential resolution; b) predicable, human-readable URI's.

The context of the most complex use-case involves 4 models and 4 machines (and assumes a Distributed or Federated Resolver)

:modelA is on server1 on host1 and needs 
     to reference :modelB and :modelC
:modelB is on server2 on host2
:modelC is on server3 on host3
:modelD is on server4 on host4 run by an unrelated organisation

The application needs to perform the query:

select $id subquery(
  select $s $p $o 
  where $s $p $o in $locn and 
        $id <mulgara:locatedAt> $locn in <mulgara:modelURLResolver>)
from host1:modelA
where [ <:useModel> $identifier ] ;

Which queries each model listed in :modelA after converting their identifier into a URL via a posited resolution mechanism.

Now host2 fails, and we restore server2 on host3 to run alongside server3.

We would like to be able to have the query run unmodified.

What this means is that :modelB cannot encode host2 in its URI.

The URI does need to encode some sort of server-id as servers are guaranteed to use the same model-names at least some of the time (consider all system-model's have the name "").

Also because :modelD and :modelA-C are managed by unrelated organisations we must somehow encode the organisation in the model's URI-stem as they may well decide to use the same server-id ("server1" or "database" anyone?).

Also consider that any encoding of the organisation must also allow that organisation to maintain their own independent registry, or the proposal ceases to be scale-free (it's on this that the original UUID proposal floundered).

I have considered abandoning requirement 4, and just using URL's. However ultimately we require a canonical name for internal purposes (even if it isn't exposed externally), and so even using URL's we would have to pick a designated 'unique name' for the model - we can't escape that - so we might as well save ourselves the headache and make it unambiguous.

So a summary of my thinking on the use-cases/requirements for rdf model-names - we desire:

1. Unambiguously an identifier
2. Encodes organisation
3. Encodes server-id
4. Doesn't encode hostname
5. Potentially resolvable via a per-organisation registry

* Proposal

If we wish to be unambiguous then we should use our own URI-scheme. This has the added benefit that once we use our own scheme we have a lot more flexibility regarding how we structure the rest of the URI to meet our requirements.

I am proposing to use the scheme 'rdfdb' - as did the original UUID proposal.

I would prefer to avoid the use of opaque URI's; there is no reason why our URI can't be introspected if we structure it sanely - so the structure according to RFC2396 will be 'rdfdb://authority/path'.

Logically the model-name itself makes a good path so we arrive at 'rdfdb://authority/modelName'. Leaving the need to encode an organisation and a server-id in the authority in a fashion that will potentially permit resolution via a registry.

Now as the authority is not a hostname, RFC2396 identifies us as a "Registry-based Naming Authority". As such, the characters we have permitted to us are [ - _ . ! ~ * ' ( ) $ , ; : @ & = + ] (excluding the []'s) - and the characters reserved are [ ? / ].

I therefore propose to structure the authority 'server-id~organisation-id' (that is the server-id and org-id separated by a tilde).

At the moment we don't support hierarchical server-id's; but I would like to leave us the option of doing so once we start supporting more aggressive distribution. We also need to consider that it needs to remain a valid path-element for use in our existing model-URL's. So for now I would like to limit server-id to what we currently use, but ultimately I think we should consider some sort of delimited hierarchical form (probably dotted).

The organisation-id should be something that will eventually permit the identification of a registry. For now a dotted hierarchical form should suffice - although I will make sure the implementation leaves this as open as possible (the use of a tilde makes this possible).

It has also been suggested that to make it unambiguously clear we are *not* encoding a hostname as the organisation-id we should invert the traditional dns-style representation.

So putting all the pieces together: If I am running a mulgara server -

host:         pneuma.netymon.com
organisation: netymon.com
server-id:    rdfDatabase
model-name:   addressBook

The model URL for addressBook remains: rmi://pneuma.netymon.com/rdfDatabase#addressBook
or: soap://pneuma.netymon.com/rdfDatabase#addressBook ...etc...

and the model URL for the model is: rdfdb://rdfDatabase~com.netymon/addressBook

Monday, September 17, 2007

Operational vs. Denotational Semantics

Spent a little time this afternoon discussing several topics with LB and SR. One topic we touched on was our continuing efforts to understand the distinction between denotational and operational semantics - I continue to be surprised at just how hard it's proving to nail down the precise distinction.

Looking at the various scrawls on my whiteboard that are the only physical remnants of what was a fascinating discussion, I gave some more thought to the distinction, and I believe it can be described thus:

Operational : M ( P ) |= σ -> σ'
Denotational : M ( P ) |= κ -> P'
ie. In Operational semantics the meaning of a program is a transition function on a virtual machine.
in Denotational semantics the meaning of a program is a mapping from an initial basis to a new (simplified) program.

Now this is confused by most operational semantics being "small-step", where the meaning is defined via structural recursion on the abstract grammar:

M ( ρ ) |- σ -> M (ρ') σ' { the meaning of an AST production is a transition function from an initial VM state to the meaning of a (simplified) production applied to a new VM state. }
Which end up looking very similar to denotational semantics as denotational semantics are normally defined via structural recursion.

But even for the similarity there remains the core distinction (as I understand it) - Denotational Semantics are defined in terms of a reduction-semantics from program+basis to a simplified program (although normally a program in a different, mathematically tractable language such as the lambda-calculus) whereas Operational Semantics are defined in terms of a transition-semantics from a program+initial-state to a new state (although normally a state extended with function-values defined in a mathematically tractable language such as the lambda-calculus).

Now of course Interpreter Semantics is a whole different kettle of fish, and of course leaves you facing the 'turtle problem' - you can try for 'turtles all the way down' if you like, but once you hit the metal you've pretty much lost the point of a semantics in the first place. I must admit admiring the way ECMAscript handled it - an interpreter semantics in SML, which has an operational semantics of its own avoiding the problem.

Thursday, August 30, 2007

RDF, Document Repositories

I have recently had several discussions with people working on/with various document repository projects. One thing I hear is the increasing understanding of the importance of semantics in repository metadata management. I'm naturally pleased to see this because storage, query, and management of semantics is precisely where projects like mulgara come in.

Now digital repositories are alot more than just a semantic store, there are also issues associated with the actual document storage, retrieval, and the QA, workflow, and various metadata extraction tasks. However these days most repository projects include some sort of semantic store to manage metadata, and the question is reasonably asked - why not just use that?

The reason I recommend considering using mulgara to augment a document repository is due to the additional flexibility gained by doing so. When a repository developer approaches the metadata problem they have a natural tendency to view adopt an 'instance-focused' view of metadata. This is where the focus is on the 'document', where each document is of a given 'document type', and each 'document type' implies corresponding 'attributes'. In contrast, RDF is 'property-focused' --- where the focus is on defining the 'meaning' of 'properties', and a 'document' having a given property 'classifies' the document as belonging to a given 'class of documents'.

While the two are duals they do influence design. If you are taking an instance-focused approach you will find yourself heading towards document-type hierarchies and defined attribute lists. If you take a property-focused approach you will find yourself defining lexicons and ontologies. The former tends towards relational approaches, the latter towards semantic approaches such as RDF. The reason why I believe the semantic approach to be superior is in its flexibility. Hierarchies are predicated on centralised control. Even if you can maintain central control over the document type definitions and their attributes, the very act of standardisation in this manner leaves your project with the unfortunate choice of 'scale' or 'responsive to change'.

RDF and its related standards allow the decentralisation of attribute/property definition, while providing the tools to manage the resulting 'mess'. With the combination of namespaces, RDFS, and assuming the provision of schema repositories, it becomes possible to allow global use and reuse of locally defined 'models'. This is especially relevant when you consider that 'local' should be considered in both temporal and spatial senses - keeping the system responsive both to the differing needs of independent organisations, and the changing requirements of a single entity.

The result is the need for two additional boxes in your architecture diagram. The first, a vocabulary server that allows users to define their own vocabulary extensions, and makes those definitions available to applications and critically capable of management and eventual rationalisation. The second is a metadata server that can store the resulting data and permit ad-hoc querying by applications, and the inferencing required by vocabulary rationalisation. One of the reasons mulgara exists is to provide the storage and query engine required by these components - and so I do enjoy having the chance to talk to people about it.

Wednesday, February 14, 2007

The Effect of File Sharing on Record Sales: An Empirical Analysis

Felix Oberholzer-Gee Harvard University and Koleman Strumpf University of Kansas

Abstract

For industries ranging from software to pharmaceuticals and entertainment, there is an intense debate about the appropriate level of protection for intellectual property. The Internet provides a natural crucible to assess the implications of reduced protection because it drastically lowers the cost of copying information. In this paper, we analyze whether file sharing has reduced the legal sales of music. While this question is receiving considerable attention in academia, industry, and Congress, we are the first to study the phenomenon employing data on actual downloads of music files. We match an extensive sample of downloads to U.S. sales data for a large number of albums. To establish causality, we instrument for downloads using data on international school holidays. Downloads have an effect on sales that is statistically indistinguishable from zero. Our estimates are inconsistent with claims that file sharing is the primary reason for the decline in music sales during our study period.

In the Journal of Political Economy.

Available in Volume 115, Number 1, February 2007

From Ars Technica via Lawrence Lessig

Tuesday, February 13, 2007

Java generics and the covariance and contravariance of arguments

Well given we require 1.5 now for other reasons, and 1.5 does complain if you don't constrain generic classes I have finally bitten the bullet and started using generics. Unfortunately I just got bitten by what I suspect is going to be a very common mistake - in this case by failing to properly consider the type equivalence of parametrised method calls.

Consider the following code:

public interface TestInterface { }

public class TestClass implements TestInterface { }

import java.util.ArrayList;
import java.util.List;

public class Test {
 private List<testclass> list;

 public TestInterface test() {
   list = new ArrayList<testclass>();
   list.add(new TestClass());

   return covariant(list);
 }

 public TestInterface covariant(List<testinterface> ilist) {
   return ilist.remove(0);
 }
}
Now there is absolutely no reason why this should not work. It is trivially inferable that the above code treats ilist as covariant in the list-type - and that therefore this code is statically correct.

Of course Java's typing has never been particularly smart. List<t1>.add(T1) is contra-variant in t1, and T2 List<t2>.get(int) is co-variant in t2; so the Java compiler is correct to infer that in the general case List<t1> and List<t2> are substitutable iff t1 == t2.

If we can't declare a generic parameter to be covariant in its type parameter we have a serious problem - it means that any non-trivial algorithm involving collections is going to run afoul of this. You might consider trying to cast your way around it:

 public TestInterface test() {
   list = new ArrayList<testclass>();
   list.add(new TestClass());

   return covariant((List<testinterface>)list);
 }
but not surprisingly that didn't work.
Test.java:11: inconvertible types
found   : java.util.List<testclass>
required: java.util.List<testinterface>
 return convariant((List<testinterface>)list);
                                       ^
1 error
If you continue to hack at it you might try a double cast via a non-generic List.
 public TestInterface test() {
   list = new ArrayList<testclass>();
   list.add(new TestClass());

   return covariant((List<testinterface>)((List)list));
 }
This works but leaves us with the unchecked/unsafe operation warning:
Note: Test.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Now this is a perfectly reasonable warning - it is unchecked; it is unsafe; and more importantly it does violate encapsulation. The problem here is that the caller should not be defining the type invariants of the callee - that's the job of the method signature!

The correct solution is to allow us to declare covariant() to be covariant in its argument; and fortunately Java does support this.

To declare an argument to be covariant in its type parameter you can use the extends keyword:

 public TestInterface covariant(List<? extends TestInterface> ilist) {
   return ilist.remove(0);
 }
To declare an argument to be contravariant in its type parameter you use the super keyword:
 public void contravariant(List<? super TestClass> clist, TestClass c) {
   clist.add(c);
 }
Without these two facilities generics would be badly broken, so I am glad Sun had the presence of mind to include them - btw if you are using Java 1.5 I strongly recommend you read the Java Generics Tutorial

As an aside it is worth noting that as Java includes a Top type 'Object', List is a common covariant type - sufficiently common that Sun has included a syntactic sugar for it, List. Personally I'm not sure this was such a good idea, List would work anyway, and I think I would prefer to have kept the covariance explicit.

Update: Corrected capitalisation error in initial java example.

Tuesday, February 06, 2007

Five things to consider when writing a Mulgara Resolver

Ended up writing a longer response than I had planned to a query about writing a resolver in Mulgara today. I'm putting it here to keep a handle to it as it does cover the basic structure of the resolve() method in reasonable detail.

First it is important to realise that resolvers *don't* return triples - they return Resolutions. These are Tuples that provide bindings for the variables in the Constraint passed to resolve(). So in the case of <http://www.example.com/path/subpath> $predicate $object the resulting Resolution should have two variables ($predicate $object). In the case of <../subpath> <http://www.schema.com#parent> $subpath it will have one ($subpath).

You should also be aware that a Resolution can be unevaluated! It is not uncommon for bindings, required to evaluate the constraint, come from other parts of the query. Consider the following where clause:

$url $p $o in <rmi://localhost/server1#sample> 
and 
<myfile> <hasurl> $url
in this case your resolver will be asked to resolve ($url $p $o), return a Resolution that will later be passed the $url in the prefix argument to beforeFirst(). Evaluation would then occur either in beforeFirst() or in the calls to next() - we prefer it to happen in beforeFirst if the memory requirement isn't unreasonable, our algorithmic reasoning assumes a comparatively cheap next().

If you require that a particular variable be bound prior to final evaluation then you need to provide a MandatoryBindingAnnotation - this indicates to the join logic that it must ensure a specific binding is satisfied by other constraints in the query before you are evaluated (in this case $url).

It is also worth noting that due to the support of intervals and the resulting interaction with query transformations, the XSDResolver is quite complicated as resolvers go. Without that a call to resolve consists of:

  1. Obtaining the model (constraint.getModel()).
  2. Do any preparatory work, especially any work that might be able to prove the result Empty (or a singleton).
  3. If you can't prove the result empty (or singleton), defer further evaluation to the returned Resolution.
Then inside the Resolution you need to consider how you implement the following four key methods
MandatoryBindingAnnotation
are there any variables that *must* be bound for the deferred evaluation to terminate.
DefinablePrefixAnnotation
can you cheaply reorder the variables in the result (log n or less)
ReresolvableResolution
can you cheaply reresolve the constraint if additional information becomes available (again log n or less) [note: this will become an Annotation like the other two in the Mulgara 1.2 dev-cycle]
beforeFirst()
you can ignore the suffixTruncation arg, but you can't ignore the prefix - these *are* the values of the first N variables of the resolution - if all the variables are passed as a prefix your only decision is 1 row or 0 - but most of the time you will be passed less than this.
At this point you have either performed the evaluation, or you have setup the evaluation and deferred the rest to be done incrementally on each call to next().
next()
does whatever is required to ensure that calls to getColumnValue().
There is only one Tuple class that defers evaluation beyond this point (the implementation of count()). Naturally we don't want to go to the effort of evaluating an entire subquery until the user actually goes to use it - so we defer evaluation of the count() until the call to getColumnValue().
getColumnValue()
normally this is a matter of returning values calculated in either beforeFirst() or next() - occasionally this amounts to evaluating it but this is uncommon.

The whole point of the Resolution/Tuples/beforeFirst/next bother is to implement lazy-evaluation in java. We only scale to bignum-levels when all query evaluation is done on a call-by-need basis.

Thursday, January 25, 2007

Umm, would you mind if we borrowed your DPP?

from boingboing

Finally a public official has the courage to speak sense.

"London is not a battlefield. Those innocents who were murdered on July 7 2005 were not victims of war. And the men who killed them were not, as in their vanity they claimed on their ludicrous videos, 'soldiers'. They were deluded, narcissistic inadequates. They were criminals. They were fantasists. We need to be very clear about this. On the streets of London, there is no such thing as a 'war on terror', just as there can be no such thing as a 'war on drugs'. "The fight against terrorism on the streets of Britain is not a war. It is the prevention of crime, the enforcement of our laws and the winning of justice for those damaged by their infringement."
If this was the only sentiment I would applaud, but to see his respect for the rule of law gives me hope our nations can come through this intact.
"We wouldn't get far in promoting a civilising culture of respect for rights amongst and between citizens if we set about undermining fair trials in the simple pursuit of greater numbers of inevitably less safe convictions. On the contrary, it is obvious that the process of winning convictions ought to be in keeping with a consensual rule of law and not detached from it. Otherwise we sacrifice fundamental values critical to the maintenance of the rule of law - upon which everything else depends."
There is no right, no principle, no aspect of our 'way of life' that isn't completely dependent on the rule of law. Without rule of law every other guarantee, social contract, or bill of rights is moot - completely and absolutely unenforcable and thereby irrelevant.

Note to any political types who might read this: This is a vote swinger. The courage to respect and maintain rule of law in the face of fear is my definition of leadership, and what I am looking for above all when deciding my vote.