Wednesday, September 19, 2007

Model-URI/URL Use-cases and Requirements and Proposal

Just posted this to mulgara-general - posting here to provide readily accessible permanent reference. I would greatly appreciate any comments anyone may have - please also feel free to solicit comments from outside the mulgara community if there is interest.

The Use Cases and Requirements
The three key requirements of a model-URI proposal are:

1. Protocol/Scheme independence
2. Model/Server mobility
3. URI-standards compliance (ie. no fragment)

Also desirable are

4. Unique-name
5. Namespaced to allow a) potential resolution; b) predicable, human-readable URI's.

The context of the most complex use-case involves 4 models and 4 machines (and assumes a Distributed or Federated Resolver)

:modelA is on server1 on host1 and needs 
     to reference :modelB and :modelC
:modelB is on server2 on host2
:modelC is on server3 on host3
:modelD is on server4 on host4 run by an unrelated organisation

The application needs to perform the query:

select $id subquery(
  select $s $p $o 
  where $s $p $o in $locn and 
        $id <mulgara:locatedAt> $locn in <mulgara:modelURLResolver>)
from host1:modelA
where [ <:useModel> $identifier ] ;

Which queries each model listed in :modelA after converting their identifier into a URL via a posited resolution mechanism.

Now host2 fails, and we restore server2 on host3 to run alongside server3.

We would like to be able to have the query run unmodified.

What this means is that :modelB cannot encode host2 in its URI.

The URI does need to encode some sort of server-id as servers are guaranteed to use the same model-names at least some of the time (consider all system-model's have the name "").

Also because :modelD and :modelA-C are managed by unrelated organisations we must somehow encode the organisation in the model's URI-stem as they may well decide to use the same server-id ("server1" or "database" anyone?).

Also consider that any encoding of the organisation must also allow that organisation to maintain their own independent registry, or the proposal ceases to be scale-free (it's on this that the original UUID proposal floundered).

I have considered abandoning requirement 4, and just using URL's. However ultimately we require a canonical name for internal purposes (even if it isn't exposed externally), and so even using URL's we would have to pick a designated 'unique name' for the model - we can't escape that - so we might as well save ourselves the headache and make it unambiguous.

So a summary of my thinking on the use-cases/requirements for rdf model-names - we desire:

1. Unambiguously an identifier
2. Encodes organisation
3. Encodes server-id
4. Doesn't encode hostname
5. Potentially resolvable via a per-organisation registry

* Proposal

If we wish to be unambiguous then we should use our own URI-scheme. This has the added benefit that once we use our own scheme we have a lot more flexibility regarding how we structure the rest of the URI to meet our requirements.

I am proposing to use the scheme 'rdfdb' - as did the original UUID proposal.

I would prefer to avoid the use of opaque URI's; there is no reason why our URI can't be introspected if we structure it sanely - so the structure according to RFC2396 will be 'rdfdb://authority/path'.

Logically the model-name itself makes a good path so we arrive at 'rdfdb://authority/modelName'. Leaving the need to encode an organisation and a server-id in the authority in a fashion that will potentially permit resolution via a registry.

Now as the authority is not a hostname, RFC2396 identifies us as a "Registry-based Naming Authority". As such, the characters we have permitted to us are [ - _ . ! ~ * ' ( ) $ , ; : @ & = + ] (excluding the []'s) - and the characters reserved are [ ? / ].

I therefore propose to structure the authority 'server-id~organisation-id' (that is the server-id and org-id separated by a tilde).

At the moment we don't support hierarchical server-id's; but I would like to leave us the option of doing so once we start supporting more aggressive distribution. We also need to consider that it needs to remain a valid path-element for use in our existing model-URL's. So for now I would like to limit server-id to what we currently use, but ultimately I think we should consider some sort of delimited hierarchical form (probably dotted).

The organisation-id should be something that will eventually permit the identification of a registry. For now a dotted hierarchical form should suffice - although I will make sure the implementation leaves this as open as possible (the use of a tilde makes this possible).

It has also been suggested that to make it unambiguously clear we are *not* encoding a hostname as the organisation-id we should invert the traditional dns-style representation.

So putting all the pieces together: If I am running a mulgara server -

host:         pneuma.netymon.com
organisation: netymon.com
server-id:    rdfDatabase
model-name:   addressBook

The model URL for addressBook remains: rmi://pneuma.netymon.com/rdfDatabase#addressBook
or: soap://pneuma.netymon.com/rdfDatabase#addressBook ...etc...

and the model URL for the model is: rdfdb://rdfDatabase~com.netymon/addressBook

No comments: