Etymon

Wednesday, November 25, 2009

JRuby and JNI and OS X.

Like practically everyone I sat down a few years back and picked up enough Ruby to write an address-book app in Rails. Nice enough language, a few nice features[0]. Still at the end of the day... YADTL, so ultimately YAWN[1].

All this is to say that I found myself yesterday facing the need to convert a trivial rails app into a java servlet, with the added complication that it uses a third-party C++ library to do the bulk of the work.

JRuby is a trivial install, and the application in question didn't make use of continuations, so rip out all the Rails code, and replace with a simple loop that takes requests from the cmdline instead of a POST parameter, and we're in business.

Well up to the point where it falls over because it needs the C++ library. Here's the first real hurdle. Apple's JDK for OS X is compiled as a 64-bit binary which can't link against 32-bit libraries[2], and by default the C compiler only produces 32-bit libraries.

$ file /usr/local/lib/libcrfpp.dylib 
/usr/local/lib/libcrfpp.dylib: Mach-O dynamically linked shared library i386
A quick check on google yields these pages that suggest that I need to add -arch i386 -arch x86_64 to the CXXFLAGS and LDFLAGS. So after modifying the Makefile to compile the library with the above flags and install it in /opt/local we have:
$ file /opt/local/lib/libcrfpp.dylib 
/opt/local/lib/libcrfpp.dylib: Mach-O universal binary with 2 architectures
/opt/local/lib/libcrfpp.dylib (for architecture i386): Mach-O dynamically linked shared library i386
/opt/local/lib/libcrfpp.dylib (for architecture x86_64): Mach-O 64-bit dynamically linked shared library x86_64

Then we need to compile the SWIG generated JNI wrapper, which will also require the above parameters, but also needs to link against Apples JNI installation. Back to Google which provides this page from Apple that describes how to compile JNI on Darwin, which provides the include path and an additional link flag.

The command lines to compile the wrapper in question becomes:

g++ -arch i386 -arch x86_64 -c -I/System/Library/Frameworks/JavaVM.framework/Headers CRFPP_wrap.cxx
g++ -arch i386 -arch x86_64 -dynamiclib -L/opt/local/lib -lcrfpp -o libCRFPP.jnilib CRFPP_wrap.o -framework JavaVM

Wanting to avoid calling System.loadLibrary from ruby code I wrote a quick wrapper class that could make the call in a static block:

package crfppwrapper;

import org.chasen.crfpp.Tagger;

public class CRFPP {
    public static Tagger newTagger(String args) {
        return new Tagger(args);
    }

    static {
        System.loadLibrary("CRFPP");
    }
}
Finally modify the relevant ruby file to link to the jar file:
require 'java'
require 'lib/CRFPP-0.53.jar'
and call the wrapper:
crfppwrapper.CRFPP.newTagger("...");

Finally modify the toplevel jruby script to pass the necessary java parameters to the jvm:

#!/usr/bin/env jruby -J-Djava.library.path=/opt/local/lib -J-cp .:lib/CRFPP-0.53.jar -w

And it worked. No more Rails; migrated to jruby; and ready to bundle into a trivial servlet that will reintroduce the POST parameter to the mix.

[0] The implicit closure parameter to every function and an elegant brace syntax for creating the closures themselves does lend itself to some wonderfully elegant idioms.

[1] Of course if you were coming from J2EE you were probably more than happy to exchange static-typing for configuration-by-convention.

[2] And all of a sudden I'm having flashbacks to horror of IRIX binary toolchain management in the late 90's.

Tuesday, September 08, 2009

ldd equivalent under Darwin (OS X)

ldd is an incredibly useful command under linux. It lists the shared-libraries required by an executable, and exactly which .so file each dependency is currently resolved to.

A similarly useful tool is od, which permits inspection of object files, especially executable and shared-libraries, extracting things like the string-table, symbol-table, and various headers and sections.

Under Darwin these tools are combined in otool. ldd can be duplicated by otool -L.

Wednesday, August 26, 2009

Gov2.0 and Open-Source - a response

Cross-posted from Data.gov and lessons from the open-source world at the Gov2.0 Taskforce blog.

My biggest worry is that the government's response to this initiative will be the announcement of some $multi-million grand-gesture. Big press-conference; Minister announcing the 'grand vision'; and the possible benefits we could see, lost in the maze that is large-scale government procurement.

The key insight of CatB is the extent to which redundancy is a benefit in exploratory development.

For the moment, we have no idea of the correct model for Gov2.0 - we have some understanding of what has worked outside of government, and a few promising avenues of approach, but no actual answers.

So I think we want to recommend that different Agencies experiment with different approaches and that the OIC be tasked with:

  1. Examining the success/failure of the different attempts, and eventually start to help agencies improve the success rate.
  2. Ensuring that the legal and regulatory requirements for aggregation and interoperability of/between these different services is standardised, as these are the issues that will derail bazaar development.
  3. Acting as a central clearing house where agencies/organisations/individuals can choose to self-publish interface descriptions, custom schema, metadata element schemes, vocabularies etc
  4. Providing a collaboration and mediation service to allow the reconciliation of conflicting interface/schema/scheme/vocab's.

The result would hopefully be a myriad of exploratory projects, some of which would fail, most of which would be ho-hum, but many of which would succeed.

The OIC would act as an institutional memory, learning and recording the lessons learnt; an institutional coordinator, making sure that people wanting to aggregate/integrate the different data-sources aren't forbidden from doing so; and an institutional mediator, assisting the different projects in finding and working together when they would like to.

Please post any comments to the gov2.0 site listed above, not here

Monday, July 13, 2009

Google’s new OS could hit Microsoft where it hurts

Quoted in a blog post by Andy Goldberg, Andy provides a quote that I suspect a lot of Microsoft sycophants will be telling themselves over the next year:

"Google may or may not have the experience and capability of actually producing an operating system and getting it deployed," he said. "It may not realise how hard it is."

Anyone who takes this line should remind themselves: Chrome OS is not an operating system! - it is browser-based windowing system running on top of an open-source OS; and Google can definitely handle that. Moreover, google has spent the better part of the past decade doing OS design and implementation. It's just that they haven't been selling it, they have been using it internally.

Friday, July 10, 2009

How to Write an Equality Method in Java

Martin Odersky, Lex Spoon, and Bill Venners have written an article on How to Write an Equality Method in Java. Given the amount of traffic my post on the same topic has attracted, I thought I might take a look. Their solution is the best I have seen, and well worth a look, but the article suffers the same problem as every article I've read on this topic: it fails to properly define equals().

A Java equals() method should implement a test for bisimulation.

To me, the remarkable thing about the Artima article is that it provides and example where the traditional getClass() approach I have previously recommended fails to implement bisimulation. The good-news is that this is a rather artificial use-case, and as such isn't going to be causing many bugs. The bad-news is that the fix requires a helper method defined at the topmost concrete class, and the need for it isn't intuitive.

Anyway, if you program in Java you need to read and understand this article.

Thursday, July 02, 2009

Scala and Various Things.

Not quite ready to move across yet - however capturing some links to make life that little bit easier when I do. I'm hoping these approaches will work as well for Elmo/OTM as they do for Hibernate.

I'll be updating this with any more links I want to save for a sunny day.

A couple more links worthy of rereading:

Wednesday, June 17, 2009

Looking for JSON, ReST, (and in-memory RDF) frameworks

Currently writing a number of small web-services to do various informatics tasks (more detailed post to come). Fortunately I'm not the one having to deal with 3rd-party SOAP apis! Still I do need to do various XML and JSON parsing, and not having addressed the latter before I've gone looking for libraries.

Currently I am about to start using Jackson, but was wondering if anyone had any warnings, advice, or recommended alternatives? In the course of looking at what was out there I have also come across Restlet, a ReST framework that seems like it is well worth the time to figure out and deploy, so I will probably be doing that soon as well, any warnings or advice on this will be welcome.

One of the nice things about Restlet is its support for RDF. Granted it doesn't support querying, and the terminology in the interface is a bit confused, but it does use its native Resource interface for URIRefs, so it should integrate well. OTOH, if it does prove useful as a ReST framework, I can see myself writing a quick Sesame or Mulgara extension, as there is only so much you can do with RDF before you need a query and/or data-binding interface.

Thursday, June 04, 2009

Deploying to tomcat using maven

This is a note to capture the process I am currently using to build and deploy a basic web application to tomcat for development.

  1. Create simple web app using archetype.
    mvn archetype:create ... -DarchetypeArtifactId=maven-archetype-webapp
  2. Add server to ~/.m2/settings.xml
      <servers>
        <server>
          <id>local-tomcat</id>
          <username>username</username>
          <password>password</password>
        </server>
      </servers>
    
  3. Add server to toplevel pom.xml
      <build>
        ...
        <plugins>
          <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>tomcat-maven-plugin</artifactId>
            <version>1.0-beta-1</version>
            <configuration>
              <server>local-tomcat</server>
            </configuration>
          </plugin>
        </plugins>
      </build>
    
  4. compile, test, deploy, and subsequently redeploy
      $ mvn compile
      $ mvn test
      $ mvn tomcat:deploy
      $ mvn tomcat:redeploy
    

Of course what I would dearly love to know is how you configure a server url in the settings.xml file. The documentation I can find (http://mojo.codehaus.org/tomcat-maven-plugin/configuration.html) describes how to use a custom url, and how to configure authentication information. What it doesn't do is explain how you can do both at the same time, and all my attempts have resulted in XML validation errors when trying to run maven --- if I figure it out I'll update this post.

Wednesday, May 27, 2009

ANTLR: an exercise in pain.

Ok, so for my current project I need to either build heuristic or machine learning based fuzzy parser. As someone who has written numerous standard parsers before, this qualifies as interesting. Current approaches I'm considering include cascading concrete grammars; stochastic context-free grammars; and various forms of hidden-markov-model based recognisers. Whatever approach ends up working best, the first stage for all is a scanner.

So I start building a jflex lexer, and make reasonable progress when I find out that we are already using ANTLR for other projects, so I should probably use it as well. Having experienced mulgara's peak of 5 different parser generators - this does eventually become ridiculous - I was more than willing to use ANTLR. Yes it's LL, and yes I have previously discussed my preference for LR; but, it does support attaching semantic actions to productions, so my primary requirement of a parser-generator is met; and anyway, it has an excellent reputation, and a substantial active community.

What I am now stunned by, is just how bad the documentation can be for such a popular tool. One almost non-existent wiki; a FAQ; and a woefully incomplete doxygen dump, does not substitute for a reference. ANTLR has worse documentation than sablecc had when it consisted of an appendix to a masters thesis!

My conclusion: If you have any choice don't use ANTLR. For Java: if you must use LL I currently recommend JavaCC; if you can use an LALR parser do so, my current preference is Beaver.

Monday, May 19, 2008

So why can't google provide html for 40% of pdfs?

A google search for opensource filetype:pdf returns the standard 10 results on the front page, but only 6 of them offer a "View as HTML" link. Is it just me, or has this become more prevalent recently? And what is the common property that results in this behaviour?

If anyone has any clues or ideas I would love to hear them.

Tuesday, January 15, 2008

Now this is how conference videos should be presented

A great collection of conference presentations and interviews (mostly on java topics) Parleys - but of particular interest to me is the presentation, the best I have seen.

Thursday, November 29, 2007

How to start thinking about RDF.

While trying to understand RDF, its capabilities and its limitations I have done a lot of reading of earlier work on semi-structured data management. It occurs to me that I haven't blogged a paragraph from an 1995 paper on heterogeneous information sources that really crystalised for me the crucial difference between RDF and previous attempts at semi-structured data (especially XML and friends) - better late than never I suppose...

We need note define in advance the structure of an object ... no notion of a fixed schema or object class. A label can play two roles: identifying an object (component) and identifying the meaning of an object (component). If an information source exports objects with a particular label, then we assume that the source can answer the question "What does this label mean?". It is particularly important to note that labels are relative to the source that exports them. That is, we do not expect labels to be drawn from an ontology shared by all information sources.

Papakonstantinou, Y. et al. "Object Exchange Across Heterogeneous Information Sources", 11th Conference on Data Engineering, IEEE Computer Society, 251-260, 1995.

The actual framework presented in the paper is focused on intraorganisational integration and so doesn't have the property of anarchic scalability required if it is to be applied to the internet - however it does express clearly this need for source-defined semantics if semi-structured data is to be tractable.

Wednesday, October 17, 2007

Implementing a Resource Manger for the Java Transaction API.

One would hope that when an organisation like Sun specifies a standard as important as a transaction api they might take the time to ensure they do a reasonable effort -- unfortunately you would be wrong. The JTA specification is freely available from Sun, however the document is extremely vague and requires the reader to infer the underlying transaction state machine - with the unsurprising result that the corner cases are left completely unspecified. So having waded through this morass myself, I include here my advice on what you need to do to understand JTA, in the hope that anyone else coming to this standard won't have to waste as much time as I have understanding it.

Read the JTA spec last. JTA is a little more than a thin wrapper around the Object Transaction Service 1.1 published by the OMG. This is one of the CORBA standards published in the late 90's and early 2000's - and like most of the early CORBA specs, it is well written; easy to read; and pretty complete. Unfortunately it too leaves the underlying state-machine and some of the corner-cases (especially those to do with recovery) underspecified. As a result I recommend printing this out and referring to it as an adjunct to the JTA spec.

Fortunately the CORBA-OTS is itself an object-oriented wrapper around another spec, the Distributed Transaction Processing: The XA Specification spec' published by X/Open (now The Open Group). There in Chapter 2 you will find most of the definitions missing from the other two specs; and in Chapter 6 the state-tables that provide definitive semantics for the various operations you will need to implement. You will also find a reference to another related X/Open specification - Distributed Transaction Processing: Reference Model - which contains all the remaining definitions and assumptions missing from the the JTA/OTS specs.

So if you do need to implement a JTA interface I strongly recommend you start with the X/Open reference model; then read the X/Open XA Spec; and only then read the JTA specification alongside the OTS spec for elaboration.

Wednesday, September 19, 2007

Model-URI/URL Use-cases and Requirements and Proposal

Just posted this to mulgara-general - posting here to provide readily accessible permanent reference. I would greatly appreciate any comments anyone may have - please also feel free to solicit comments from outside the mulgara community if there is interest.

The Use Cases and Requirements
The three key requirements of a model-URI proposal are:

1. Protocol/Scheme independence
2. Model/Server mobility
3. URI-standards compliance (ie. no fragment)

Also desirable are

4. Unique-name
5. Namespaced to allow a) potential resolution; b) predicable, human-readable URI's.

The context of the most complex use-case involves 4 models and 4 machines (and assumes a Distributed or Federated Resolver)

:modelA is on server1 on host1 and needs 
     to reference :modelB and :modelC
:modelB is on server2 on host2
:modelC is on server3 on host3
:modelD is on server4 on host4 run by an unrelated organisation

The application needs to perform the query:

select $id subquery(
  select $s $p $o 
  where $s $p $o in $locn and 
        $id <mulgara:locatedAt> $locn in <mulgara:modelURLResolver>)
from host1:modelA
where [ <:useModel> $identifier ] ;

Which queries each model listed in :modelA after converting their identifier into a URL via a posited resolution mechanism.

Now host2 fails, and we restore server2 on host3 to run alongside server3.

We would like to be able to have the query run unmodified.

What this means is that :modelB cannot encode host2 in its URI.

The URI does need to encode some sort of server-id as servers are guaranteed to use the same model-names at least some of the time (consider all system-model's have the name "").

Also because :modelD and :modelA-C are managed by unrelated organisations we must somehow encode the organisation in the model's URI-stem as they may well decide to use the same server-id ("server1" or "database" anyone?).

Also consider that any encoding of the organisation must also allow that organisation to maintain their own independent registry, or the proposal ceases to be scale-free (it's on this that the original UUID proposal floundered).

I have considered abandoning requirement 4, and just using URL's. However ultimately we require a canonical name for internal purposes (even if it isn't exposed externally), and so even using URL's we would have to pick a designated 'unique name' for the model - we can't escape that - so we might as well save ourselves the headache and make it unambiguous.

So a summary of my thinking on the use-cases/requirements for rdf model-names - we desire:

1. Unambiguously an identifier
2. Encodes organisation
3. Encodes server-id
4. Doesn't encode hostname
5. Potentially resolvable via a per-organisation registry

* Proposal

If we wish to be unambiguous then we should use our own URI-scheme. This has the added benefit that once we use our own scheme we have a lot more flexibility regarding how we structure the rest of the URI to meet our requirements.

I am proposing to use the scheme 'rdfdb' - as did the original UUID proposal.

I would prefer to avoid the use of opaque URI's; there is no reason why our URI can't be introspected if we structure it sanely - so the structure according to RFC2396 will be 'rdfdb://authority/path'.

Logically the model-name itself makes a good path so we arrive at 'rdfdb://authority/modelName'. Leaving the need to encode an organisation and a server-id in the authority in a fashion that will potentially permit resolution via a registry.

Now as the authority is not a hostname, RFC2396 identifies us as a "Registry-based Naming Authority". As such, the characters we have permitted to us are [ - _ . ! ~ * ' ( ) $ , ; : @ & = + ] (excluding the []'s) - and the characters reserved are [ ? / ].

I therefore propose to structure the authority 'server-id~organisation-id' (that is the server-id and org-id separated by a tilde).

At the moment we don't support hierarchical server-id's; but I would like to leave us the option of doing so once we start supporting more aggressive distribution. We also need to consider that it needs to remain a valid path-element for use in our existing model-URL's. So for now I would like to limit server-id to what we currently use, but ultimately I think we should consider some sort of delimited hierarchical form (probably dotted).

The organisation-id should be something that will eventually permit the identification of a registry. For now a dotted hierarchical form should suffice - although I will make sure the implementation leaves this as open as possible (the use of a tilde makes this possible).

It has also been suggested that to make it unambiguously clear we are *not* encoding a hostname as the organisation-id we should invert the traditional dns-style representation.

So putting all the pieces together: If I am running a mulgara server -

host:         pneuma.netymon.com
organisation: netymon.com
server-id:    rdfDatabase
model-name:   addressBook

The model URL for addressBook remains: rmi://pneuma.netymon.com/rdfDatabase#addressBook
or: soap://pneuma.netymon.com/rdfDatabase#addressBook ...etc...

and the model URL for the model is: rdfdb://rdfDatabase~com.netymon/addressBook