" /> Bill de hÓra: August 2003 Archives

« July 2003 | Main | September 2003 »

August 31, 2003

Testing with a Clock

I needed an object that holds information about automated builds called BuildTime*. Every Build being built will need to know what time the next build is at, what the build interval is and so on. Testing this was looking tricky since it started to involve:

  • fooling around with Date, Calendar objects,
  • taking checkpoints against the System.currentTimeMillis call and trying to reason backwards from them,
  • sleep()ing while the clock moved forward,
  • examing printouts of Dates.

The last two did it for me. There's no need for real time or high-precision clocks when it comes to an automated build. A granularity of seconds or even minutes is fine, yet there's no way I'm putting the tests to sleep() and waiting for them to finish. And we all know that reading printouts defeats the purpose of unit testing.

So eventually I threw a Clock class together and allowed the BuildTime object to take a Clock object as a constructor argument. The Clock object itself is very simple, here's the interface:

 
public interface Clock
{
  void synchronize();
  void wind( long much );
  void unwind( long much );
  void unwindToTheEpoch();
  long currentTimeMillis();
  long currentTimeSeconds();
  Date currentDate();
  Date getTodayZeroHundred();
}

Clock uses System.currentTimeMillis as a millisecond counter for application time instead of for the system time - the Clock determines system time for its client via an offset which is set using wind() and unwind() and synchronize().

In the time I've been using it, I've found the Clock to be valuable for playing time travel with the code. Hourly or daily build cycles can be tested by making a single method call to change the time, without having to wait that length of time to pass. It's also good for checkpointing against a known time and testing against intervals. For example the getToday() call returns a Date set to zeroHundred hours today, which can used to set the Clock to that time.

I'm starting to think that this may be a more generally useful idiom useful for testing asynchronous code - a lot of what I do in work involves designing and building messaging systems. These are not always easy to test automatically unless you're willing to hang around for things to kick off - notably for reliability and retry mechanisms.

The secret to testing asynchronous code seems to be making application time logical instead of physical. I think perhaps the reason testing asynchronous code is considered hard is because time is controlled by the operating system and not the software application, so the application developer has to hand around for the OS to catch up.

I can't imagine I'm the only person that is manipulating time in this way for testing - and if it's useful I'm thinking someone else has probably written a 'real' Clock . So, does anyone else have this idiom?


* By the way, if you're wondering why I don't use Cruisecontrol or Anthill, that's easy. I did. I found them sufficiently complicated, fidgety to set up, overly abstracted away from Ant, and insufficiently abstracted away from reporting, that I decided to go ahead write my own. I want to see how simple the automated build process can be once the decision not to use at or cron is made. My guess was that all you needed to stunt the framework was a build config file, a couple of daemons, timing logic, and a plugin hook for reporting. And that seems to be the way things are working out.

Testing, robustness and checked exceptions

Ted Leung contrasts checked exceptions with testing:

If your testcases actually tested the relevant error conditions, then you wouldn't need to have a language mechanism to ensure that the errors got handled. At least, you could make all exceptions unchecked, and eliminate the multi-arm catch hairballs that are lying around.

He also succinctly captures an aspect of checked exceptions I just couldn't put my finger on:

It's not like all those catch blocks that are empty, printing stacktraces, or logging exceptions are really improving the robustness of the systems that they are in.

To my mind, with all those try/catch blocks, there's a lot of noise going into the code for little benefit.

August 30, 2003

Fatboy stick

Via Charles Miller: Fatboy Slim - Weapon Of Choice.

Checked exception still considered pointless

I can't agree with Ross Judson, though I understand where he's coming from. Anders Hejlsberg is not wrong on checked exceptions, though there are other reasons to avoid them.

First, most libraries will declare a general exception class, like LibraryException, and build all their more specific exceptions using subclassing. Most methods will throw a LibraryException, and specialize from there if necessary. Problem solved. You can pay attention to the messages and concrete subclasses if you want to, but you don't have to.

One answer to all this fuss is to throw LibraryException everywhere if all you're going to do with the internal exceptions is cast them into a string/stream objects and be done with it. But, and it's a big but, tying exceptions to packages/components or a service boundary is bad design. Don't tell me something bad happened in the Library, I already know that something bad happened in the Library. Describe what happened in the Library. There's no interest or use for a LibraryException, though there may well be interest and use in documenting the unexpected behaviour in the Library. It's a bit like the comment that tells me what the next line of code does instead of why it does it, an approach we've know for years and years to be silly. Exceptions that say where they come from instead of why they arose are no different to those comments (and if you can replace the useful comment with a useful method name, more power to you).

Exceptions are about understanding behaviour, not structure. After more than 6 years of dealing with other Java folk's checked exceptions, I've concluded the only exceptions that should be thrown at package/namespace boundaries are ones that signify a behavioural issue and preferably one that is already supplied by the language's core libraries (assuming you had a decently designed core libraries to begin with, which is not entirely the case with Java). I'm also coming to the conclusion that clients should have the option to not deal with exceptions and reason against the main line if they wish, rather than be forced to handle a checked exception. Which is an argument for dropping checked exceptions altogether.

Second, chained exceptions are an effective way of dealing with the library combination issue. You capture a sub-library's exception, then wrap it with one of your own. As a library, your use of a sub-library is something that should be transparent to the calling system. Chained exceptions allow you to convey the right information back in a general form.

Chained exceptions are hacking at symptoms instead of solving problems. I should probably explain that. Chained exceptions exist solely to stop dependencies on someone else's checked exceptions bleeding into the dependent's code signatures.This is because (again) exceptions deal with behaviour not structure and as such are intimately tied to control flow not code organization. Without some such trickery, somebody else's exception type will invariably pollute your code, something which is for the most part unneccesary. Checked exceptions are in my humble opinion a mess when it comes to decoupling code. Sort of like hardcoding your debugger's checkpoints into your library and expected to world to be ok with that. I imagine tests are a better way to weed out such issues than punting with exceptions.

Almost everything we do with an checked exception is log it and read it, ie, we cast its message to a string or stream and send it to i/o. We can do that just as well with an informative message in a runtime exception or core checked exception. In my eyes the special case is almost always not a good candidate for creating a new type. Are there special cases that should be identified with a type? No doubt, but I think they're rarer that we imagine. There are only some many things that can go wrong with a computer program and we know about a good number of them already - any further refinement can be dealt with through an exception's message.

The other thing I can do with a checked exception is to move off the main line of control flow, and go execute something else. When I find myself using exceptions to outline alternative paths of of execution, in other words when I want to catch an exception and do something other than log it and continue (or, really I want my caller to do it), I need to stop and admit that I'm coding with GOTOs. What's happening is that I'm using exceptions to intimate that dealing with failure cases and programming away from the main line is an important part of the programming against my library. What I really ought to do is reveal that intent through an explicit protocol that my callers can act upon. Handling exceptions and being clever gets me into trouble more often than not.

Languages that don't check them (which is basically everything else out there) just kind of scare me right now. How do you know you've got everything handled? The answer is...you don't.

They shouldn't scare anyone. As for handling everything - asking to know if we've handled everything is like asking to know about the non-existence of bugs - in short we don't. Sometimes something bad happens in the code and the best thing to do is exit control. If you already know how to deal gracefully with an expected error somewhere in the code (checked exceptions are clearly not unexpected, or exceptional), you can code defensively to that error without a new type.

Not so long ago, Ron Jeffries went head to head with the extremeprogramming list for the best part of a week arguing against exceptions; the list pretty much came up wanting. Checked exceptions don't help much and are best minimized or avoided. But I don't expect to get much traction on that idea. This is how error handling has been done in Java for so long, and it's so idiomatic and reflexive to do things this way, it can be difficult to see what the problem is. Nonetheless, food for thought:

[Alan Griffiths: Exceptional Java]
[Bill de hÓra: Checked exceptions considere pointless]
[C2 Wiki (Robert di Falco): GeneralizeOnExceptionBehavior]
[C2 Wiki (Bill Trost): JavaExceptionsAreParticularlyEvil]

August 27, 2003

OpenOffice to use RELAX NG.

From the Oo xml-dev mailing list:

The OASIS Open Office XML format technical committee has decided to use Relax-NG as "working language". This means that we will use Relax-NG during the specification work, and will create DTD's and W3C XSD schemas based on the Relax-NG schema when the specification is finished. - Michael Brauer

Werner Vogels nails it: Web Services are NOT Distributed Objects.

Werner does a fine job of killing some myths about web services. He's looking for feedback - a must read.

August 17, 2003

Why RDF/XML is the way it is

Web Service Guy

One new feature of the Working Draft that I won't be tackling in a hurry is rdf:parseType = "Collection". Have you seen the triples that this is supposed to produce? And while we are talking syntax, why oh why is RDF / XML syntax so complicated? Any reasons or excuses for this would be appreciated because quite frankly, I just don't understand.
  • The working group charter won't allow syntax surgery. The working group charter didn't allow model surgery either, but it happened.

  • The original syntax was written when XML and particulary XML Namespaces was less well understood. For example they couldn't make up their minds up and so allowed three syntactic forms.

  • The original syntax was written against a poorly specified model.

  • Graphs are tricky in XML.

  • RDF Literals are tricky.

  • Tunneling URIs through XML is tricky.

    Lists and collections are tricky.

  • Untill about 6 months ago, not enough XML and web people were affected by RDF/XML - now XML and web people are having to pay attention, but they are two years too late to the party.

  • If you're an RDFer, you value the model above anything else, syntactic warts are incidental.

  • You can offer an alternative syntax if they want. The model is what matters.

  • You can transform it to something else if you want. Syntax is incidental.

  • Now is not a good time to make serious syntactic changes. This has been a long now.

  • People will adapt to almost anything.

I've been going on about this syntax in one form or another for nearly four years. The RDF community and the working group are not listening to syntax pushback - in my direct experience concerns about the model have always been primary and fundamental belief in RDF is very strong. I hope people with clout, like Sam Ruby and Tim Bray continue to push back now the model is nailed and the syntax spec is close to recc.

August 16, 2003

That is the sound of inevitability: Linux on the desktop

Linux took on Microsoft, and won big in Munich

Welcome to the future. But actually, SuSe took on Microsoft and won big.

I suspect what will really happen is that they'll roll out Linux everywhere, and then every mid-level bureaucrat will realize they can't do their job because some application they need just doesn't run on Linux, and they'll buy Windows XP at full retail price, burying the costs in expense reports or petty cash or somewhere else. [Joel Spolsky]
In the long run, it's just a matter of a Linux distributor getting deadly serious about going after Windows market share. It's bound to happen, after all there's only so many servers out there, and at some point in the next decade, Linux will saturate that market. Linux on the desktop will start inside firewalls and work its way out - some big companies and nations are already asking whether Linux on clients are a better economic option.[Bill de hÓra]

More on Ant's JUnit task

Commenting on the criticism of the JUnit task. Chuck Hinson pointed out that he sets the printsummary attribute. to "no". I do know this does reduce the verbiage, but highlights another problem with task - it swallows the JUnit diagnostic output for a failure.

Here's some failing code using the jar I mentioned:

  test:
.F.....................
Time: 4.026
There was 1 failure:
1) testHome(net.amadan.beats.BeatsTest)
junit.framework.AssertionFailedError: you can see me
at net.amadan.beats.BeatsTest.testHome(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:25)
FAILURES!!!
Tests run: 21, Failures: 1, Errors: 0

Here's the failure case using the JUnit task with printsummary set to "no":

  junit-test:
TEST net.amadan.beats.BeatsTest FAILED
BUILD FAILED

This output is useless. What failed?

Now again maybe there's a flag for this. But the point is that the task is by default swallowing JUnit's useful output, and replacing it with something that is not. This is no better than eating an exception trace, in fact that's exactly what's going on. This also happens with the verbose mode. The irony is that for all its verbiage, the task doesn't actually tell you what you need to know.

Later: Got a suggestion from Mike Clark, who can see traces by tuning the formatter:

  <formatter type="plain" usefile="false" />

Obvious ;) The traces are there now, thanks Mike (should I put a FAQ entry together for using the task like this?). I'm still getting a summary for each Test file though -why can't it default to JUnit output (arrghhh!). Think I'll shutup now and stick to my java hack, which is simple enough for my tiny brain!

August 15, 2003

The Case Against XP: evidence please

Extreme Programming (XP) - An Alternative View

The XP belief is that it's okay to throw away the rulebook and redefine what is possible. [...] Not using XP does not have to mean turning your back on software agility. It is possible to develop software with evolving requirements, with customer involvement and high visibility of progress, in a much more robust fashion than XP prescribes. Similarly, spending a little extra time getting the customer requirements right early on doesn't make the project non-agile.

Very interesting article. I would like to know if there are case studies on projects that had the hard issues all sorted out up front, that captured the requirements, that had the architecture nailed down, that would allow one to determine and control the cost of major changes? Essentially the ones that followed 'the rulebook' as the article suggests, and came out good.

Presumably if XP can be refuted on the grounds cited in this article, there are such projects (empirical evidence) to back them up. Moreover what percentage of projects known to follow the rulebook in question do succeed? And it would be important to fiond out whether the failures were inherently due to the rules or due to the poor application of the rulebook.

If XP is barking up the wrong tree, let's see the evidence. XP (and Agile in the large) is based first and foremost on the observation that the rulebook isn't working, but, here are some practices that haved worked and here is how to apply those practices. There are a large number of people who claim that XP is in fact working. Again it is important to determine how many XP projects are failing and is this inherently due to XP or the poor application of XP.

Kent Beck (the inventor of XP) recently suggested that XP is essentially a subset of RUP. Perhaps in the sense that a mouse is a subset of an elephant. In fact RUP and XP have totally different philosophies. But that's a subject for a different article...

This will be a very interesting article. I think the authors will have their work cut out to back up this claim. Rational have produced an Agile subset of the RUP. Grady Booch spent a good part of 2002 on the extremeprogramming list trying to find the intersection between the the RUP and XP, before and after the IBM buyout. Robert Martin has produced dX a subset of the RUP that is effectively XP. It's hard to imagine all these efforts being anything other than incoherent if RUP and XP are so far apart.

Sam Ruby: Inter-net

Sam Ruby

This:

The fulcrum that we need to focus on is the RDF parser. If we can train that to intelligently consume a large subset of XML grammers, critical mass can be achieved.

and this:

Instead of requiring one to make their XML RDF friendly, let's make RDF XML friendly.

Beautiful.

Ant's junit task is too verbose

I've been playing with other peoples code. Which involves running other people's unit tests via ant/maven. Which are very very chatty.

Now, I don't use the junit task in Ant for precisely this reason, but I'd forgotten just how chatty the task is. Here's a dump from something I've started working on yesterday:

  test:
................... Time: 4.025
OK (19 tests)

Here's an amended dump from picocontainer - there's nothing special about picocontainer I just happen to have a bash shell open for it:

test:test:
 [junit] Running org.picocontainer.defaults.ComponentSpecification...
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.07 sec
 [junit] Running org.picocontainer.defaults.DefaultComponent...
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.06 sec
 [junit] Running org.picocontainer.defaults.DefaultPico...
 [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.21 sec
 [junit] Running org.picocontainer.defaults.Default...
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.07 sec
 [junit] Running org.picocontainer.defaults.DefaultPico...
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.071 sec
 [junit] Running org.picocontainer.defaults.DefaultPicoContainer...
 [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.06 sec
 [junit] Running org.picocontainer.defaults.DefaultPicoContainer...
 [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.15 sec
 [junit] Running org.picocontainer.defaults.DefaultPicoContainer....
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.06 sec
 [junit] Running org.picocontainer.defaults.DefaultPicoContainer....
 [junit] Tests run: 7, Failures: 0, Errors: 0, Time elapsed: 0.151 sec
 [junit] Running org.picocontainer.defaults.DefaultPicoContainerWith...
 [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.06 sec
 [junit] Running org.picocontainer.defaults.DummiesTestCase
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.04 sec
 [junit] Running org.picocontainer.defaults.OldDefault...
 [junit] Tests run: 29, Failures: 0, Errors: 0, Time elapsed: 0.251 sec
 [junit] Running org.picocontainer.defaults.ParameterTestCase
 [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.071 sec
 [junit] Running org.picocontainer.extras.CompositePico...
 [junit] Tests run: 11, Failures: 0, Errors: 0, Time elapsed: 0.08 sec
 [junit] Running org.picocontainer.extras.DefaultLifecycle...
 [junit] Tests run: 8, Failures: 0, Errors: 0, Time elapsed: 0.34 sec
 [junit] Running org.picocontainer.extras.HierarchicalComponent...
 [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.07 sec
 [junit] Running org.picocontainer.extras.ImplementationHiding...
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.08 sec
 [junit] Running org.picocontainer.NoPicoTestCase
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.04 sec
 [junit] Running org.picocontainer.PicoInvocation...
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.05 sec
 [junit] Running org.picocontainer.PicoPicoTestCase
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.07 sec

Does anyone finds this verbiage in any way useful?

Maybe's there's way to turn this off, I haven't found it. But what I don't understand is why the task by default replaces JUnit's nicely minimal output with it's own fluff. To me, it looks suspiciously like a malformed report *. True that picocontainer has about 100 tests and I have only 19, but my output won't get any longer, whereas picocontainer's will keep growing with every new test case.

I wrote some code years ago that walks a src/ directory and loads any class that has a marker interface into a suite and executes it. This was so I wouldn't have to keep manually adding test class to a suite and to avoid the way master suites force you into bidirectional dependencies for projects with subprojects. Classic programmer laziness. Its jar file is one of 3/4 core dependencies (excluding Ant) I have for all projects - the others are junit and jdepend, and sometimes catalina-ant.

One side effect of this code is that it makes for a simple Ant task:


<target name="test" depends="compile-test"
description="run the junit tests">
<echo />
<java
fork="${test.fork}"
classname="${test.classname}"
taskname="${test.task.name}"
failonerror="${test.failonerror}"
classpathref="test.classpath" >
<arg value="${test.alltests}" />
<sysproperty key="install.location"
value="${install.target}"/>
</java>
</target>

Maybe I could change the marker to a *Test* pattern or an aspect, but it's so simple to create a test class template with the marker for any IDE it doesn't seem worth it. The downside is that you end up loading classes twice, which is not too slow, but slower than loading them once. [Btw, if you spotted the 4 second run for 19 tests above, don't worry too much - the tests are testing the lifecycle logic of a timer/scanner and are littered with sleep() calls]

I have thought about hacking something this straight into JUnit and now that I've seen the way Erich Gamma responded to Scott Sterling's exellent JUnit classloader fix, maybe I should. Aside from any inefficiencies, having to manually manage suites and testcases is not good from a usability standpoint (yes I know some IDEs will take care of this, that's not the point, plus I'm not much on depending on IDEs for build management).

* As for the gui or reporting output, a mockup of my ideal display is [here]

You know you want it: Wiki SVG whiteboard

Raw Blog: WikiWhiteboard with SVG

To include a WikiWhiteboard (heh, it just named itself) on a page all you have to do is go into the edit mode and put one of these in the page :

[svg]

In every other respect (apart from the bugs), it behaves like a regular Wiki.

This is very cool. I bet every wiki and blog codebase out there will follow suit.

August 14, 2003

Meta-models all the way up: and why RDF is useful anyway

I've been away for a week and it looks there's a lot going in the RSS/RDF world; in grand blog tradition, time to comment on the comments :)

Spike Solution

Sam Ruby's hacked an Atom 2 RDF sheet together
to get a feel for the 'RDF Tax', which has provoked a number of comments.

Dare Obasanjo wanted to know what the point was:

I'm still waiting for anyone to give a good reason for Atom being RDF compatible besides buzzword compliance. Since you seem interested in making this happen can you point out the concrete benefits of doing this that don't contain the phrase "Semantic Web"?

RDF is a BingoBuzzword now? Anyway, Let's try to come up with a few reasons why RDF might be worthwhile. We'll start with a standard device for any technology that is not widely adopted - draw a spurious analogy to it and Lisp.

Lisp is extremely powerful because it's evaluation rules allow for meta circular evaluation (in the history of big ideas in computing, this one is right up there). Lisp is what some call a programmable programming language. This is enabled by its 'weird' syntax, which is very likely an optimal one (although Ruby and Python give us pause). Consider that you can add just about any programing langauge feature/craze to Lisp without altering the evaluation rules and consider that many of the current features/crazes we're interested in today have been availble as Lisp graft-ons for what would be in Internet years roughly a century. Compare that to the funky way languages like Java (and in time, C#) extend and adapt - they evolve by applying patches, not grafts. These are not 100 year languages. But, Java and C# have familiar syntax and active communities which will continue to develop and apply patches and band-aids, covering a multitude of sins resulting from non-uniformity.

RDF is tangentially like Lisp, only for content instead of programs. The potential power of RDF is in the uniform way it lets you describe content. So, here are three areas where I think RDF's uniform content model can help.

- Vocabulary mixins

Since it's triples all the way down, the process for vocabulary merging in RDF is much the same as that for merging two graphs, with extra constraints thrown in in order to keep the meaning of the new graph consistent with RDF's rules (as described in the RDF model theory). It's very simple in principle. In practice the tools aren't so hot and the XML is too difficult to work with directly - ultimately people want to deal with RDF in XML. Yet, it's probably easier to express and manage a dependency graph using raw makefiles or Ant hacks than express and manage simple content in RDF/XML. That's damning.

- Vocabulary extensibility

In RDF, extensibility is subtly different to the kind of extensibility that XML/RSS authors are often concerned about, which is the insertion of new names into sets of namespaces, what we might call shallow extensibility. To a large degree they are concerned about this because that is the place where XML Namespaces ushers them. The idea of vocabulary extension via mixing of vocabularies, what might be called deep extensibility is much less considered because you simply can't express that idea within the language game presupposed by XML Namespaces. This is why you need a uniform content model to make sense of namespaces and that XML Namespaces is a technology for firewalling content, not extending it. Though by the time you had a uniform model, I'd be arguing that these namespaces were a poor way to serialize the vocabularies anyway :)

The reason RDF is good at extending and merging vocabularies is because there aren't any.

Let me explain. I think the area that excites people most about RDF is that extensibility is achieved by merging graphs, not vocabularies. Vocabularies for RDF are just sub-graphs - that people happen to find them useful or not is incidental. Vocabulary has the same bearing in RDF as Nationality has in genetics - which is none at all. In much the same way my being Irish is irrelevant to my genetic makeup, my having a vocabulary is irrelevant to my RDF. In a very real sense, there are no vocabularies in an RDF graph. That you see them is at all is an illusion. (The only fundamental aspect of being Irish with regard to RDF is a love of the number 3 :)

To extend some RDF you add add new triples to its graph. That's it. You find out what nodes on the current graph are the same as the nodes on the graph you want to import, those are your join points. You don't figure out how to scrunch two blocks of XML or two databases together. You don't transform one vocabulary to another, or if you a systems integrator, look to establish the least general vocabulary that will unify the two (if you are standards body, you will naturally look to find the most general unifying vocabulary). While there are usability and social benefits in having them around, there's no need to enforce the separation of vocabularies in RDF machinery - that's very much an artefact of XML Namespaces.

All this for me tips XML Namespaces alledged usefulness for managing vocabularies into a cocked hat, being at best a means to transport RDF graphs around using XML. They're mechanism, not policy, and we need to stop treating them as policy.

- Query

Perhaps the most interesting use for RDF is query. This shouldn't be a surprise - query is simply an application of the very inference that makes so many people suspicious of RDF as the key ingredient of a pie in the sky AI project (the formal relationships between query, inference and theorem proving have been well understood for decades).

There is a growing need to coordinate and unify content - the rate of content generation far exceeds our ability to manipulate and digest it, even with statistical techniques and on the fly transformation to known structures. Uniformity is a big idea here - XML provides uniform syntax, but we still could do with a uniform content model for the query engines. As Edgar Codd found out, it's easier to query over a uniform model. And unless one of the existing KR languages sees overnight adoption, the relational model gets a facelift, or there are fundamental breakthroughs in the math behind statistical search, RDF is the only game in town. You can of course choose to patch things together in an analog of the commercially popular languages, Heath Robinson style. But you might end up suffering death from a thousand hacks.

Jon Udell draws a comparison between RDF uniformity and XML querying:

If the RDF folks have really solved the symbol grounding problem, I'm all ears. I'll never turn down a free lunch! If the claim is, more modestly, that RDF gives us a common processing model for content -- a Content Virtual Machine -- then I will assert a counter-claim. XML is a kind of Content Virtual Machine too, and XPath, XQuery, and SQL/XML are examples of unifying processing models. [...]

This is very good point, but I'll make a distinction to help see why RDF has soemthing else to offer. The distinction is between inscription and description. XML in this sense is a Syntax Virtual Machine. It's a grammar for grammars. And as such is immensely valuable, now that we mostly agreed to use it for new data formats. RDF is a grammar for content.

To draw a flaky biological analogy, XML is the raw material you would inscribe nucleic and mitochondrial DNA strings with. RDF is the stuff you would use to describe those strings to the proteins (which by the way, read in DNA as triples ;) who make the molecules blueprinted by the DNA. To be honest, I believe that syntactic building blocks are more fundamental than content blocks. Which is possibly why I depart company with RDFers regarding RDF/XML.

(As for symbol grounding, no such luck :)

[...] As we move into the realm of extensible aggregators we'll face the same old issues of platform support and code mobility. Nothing new there. However, as XQuery and SQL/XML move into the mainstream -- as is rapidly occurring -- aggregator developers are going to find themselves in possession of new data-management tools that can combine and query structured payloads. Those tools will not, because they cannot, know a priori what those payloads mean. But they'll provide leverage, and will simplify otherwise more complex chores. I can't see the endgame, but for me this is enough to justify doing the experiment. [RDF uniformity and XML querying]

XMLQuery has a practical importance in the enterprise. In principle it will lets us delay the moment we start putting XML into relational databases and therfore help mitigate creating huge system bottlenecks and dissonant system layers. My hope is that it will let us scale beyond file system inspections and ad-hoc scripts while avoiding the problem of n-tiered middleware, namely all roads lead to a database. But neither it nor XPath is a basis for describing the domain level concepts that businesses care about. Today the best practical approach is an XML document designed by a good modeller and let the application programmers have at, but RDF backed query could be a very powerful and highly scalable augmentation to good XML document designs.

Note that this is quite distinct from a uniform ontological model or a model of domains - no-one who understands what RDF is about it looking for a theory of everything. That is what you should be suspicious of - anyone offering The One True ontology, The One True model is selling snake oil or needs a crash course in modern philosophy. RDF is just another way to ease the pain of integration (where integration = interoperation + costs).

RDF/XML syntax

Arve Bersvendsen:

Human readability is underestimated. If this atrocius RDF syntax is chosen, you are going to scare off relatively non-technical people from producing an Atom feed. You can argue from here 'till eternity that "It's just applying a transformation with XSLT", but remember that many (most) people even have problems understanding a concept as simple as CSS.

Unless the Atom syndication format is going to lie in the same forgotten pool of mud that W3C RSS is, I believe RDF is best forgotten

Arve is objecting to RDF/XML while naming RDF. But on the whole, RDF itself is a pretty good idea, even if it is meta-models all the way up. RDF/XML undoubtedly remains a problem. It's the main reason I went into the RDF wilderness and don't normally evangelise RDF in my work - in the trenches, syntax matters.

Danny Ayers

I don't think the 25% increase is anything to worry overmuchabout - I'm sure it can be brought down significantly (I notice an rdf:Description in there for a start). Some human readability might have been lost in the process, but a lot of machine-readability has effectively been gained.

To butcher Alistair Cockburn's observation on process - machine readability can only ever be a second order effect, human effect is always first order. Unfortunately RDF/XML is a barrier to entry whatever the rationales offered by RDFers. The current excitement about RDF pales into insignificance compared to what might happen with a usable, hackable syntax - it could have been game set and match two years ago instead of the daily uphill battle of selling RDF.

Elsewhere, Danny has said that RDF/XML is not the only fruit and points to Uche Ogbuji's observation that for RDF, there is no syntax. This is only partially true. Practically there must be a syntax and consequent sofware codecs, or we can't communicate (if someone has a communications model without the presence of syntax, feel free to follow up - and by all means do so without recourse to syntax ;).

That RDF is syntax-independent is often touted as an unqualified benefit, but it's not always. It is a benefit in that we can reason about the rules for merging RDF graphs with recourse to syntax - this no different to be able to reason about a theorem without recourse to a notation, often the notation can get in the way. Likewise it's good to be able to talk about a graph without having to worry about Java Objects or C pointers. But in practice there is always a notation, a syntax, and in software development, notation is frightfully important. It's what you will be working with day in day out after all.

After four years of head scratching, I genuinely believe RDFers have a double blindspot with the XML syntax. First off they see so much value in RDF that the benefits outweigh the costs of the syntax, or any syntax. Second saying there is no syntax seems like the ultimate get-out clause - don't like the syntax? It's ok, the syntax doesn't matter, and you can write new one if you want. The first blind spot I have sympathy for. The second I have none, not any more. Interoperation doesn't start with models, it starts with syntax. Shared syntax is the prerequisite for interop. We have a couple of decades' experience to back that observation up. Interoperable models may indeed follow and piggyback on syntax, but looking to models first is a mistake. And if the first reaction to the syntax is to reject it, pushing the model around is going to be tough work.

August 08, 2003

The road to Damascus: Jon Udell and RDF

Jon Udell: An RSS/RDF epiphany.

Arguably I should get a life :-), but for me this remark was an epiphany. I've long suspected that we won't really understand what it means to mix XML namespaces until we do some large-scale experimentation. What I hadn't fully appreciated, until just now, is the deep connection between RDF and namespace-mixing. Dan's original hard-line position, he now explains, was that there is no sane way to mix namespaces without some higher-order model, and that RDF is that model. That he is now modulating that position, and saying that none of us yet knows whether or not that is true, strikes me as both intellectually honest and potentially a logjam-breaker.

One of the use cases for XML namespaces was in fact RDF which had a (perhaps at the time the) strong need to mix vocabularies. Especially, it needed a way to embed URIs in a syntax that would not allow them. The solution was the QName. As it turned out, XML Namespaces are less than ideal for transporting URIs and they are certainly not a sufficient mechanism to deliver RDF - you still need to finesse the XML into the infamous striped syntax, which often seems like a pointless burden.

But, to their credit, what the RDF people have always understood and what the XML people of all walks who look to XML Namespaces to this day still do not appreciate, is that you can't mix and match with XML Namespaces without a underlying data model to unify the vocabularies. RDF has a model - it's the triple relation.

We have to understand that XML namespaces is not a technology for integrating and extending vocabularies - it's a technology for firewalling them keeping them apart. Yet, without the shared model, there's not much point in doing this. Indeed, it's a fool's errand - for as soon as you've defined your set of names, job one is to turn around and integrate them with someone else's set of names. You may well think you need to be able to add new names without them banging into someone else's names, but what you really want to do is compose names to create new compound vocabularies. Integration is where the value is.

In response to Danny's article, Jon says:

Shouldn't we then substitute XML for RSS 2.0 in that sentence, and say there is no consistent way to interpret material from other namespaces in any XML document, period?
Shouldn't we then say, there is no reason to create any mixed-namespace XML document that is not RDF?

Yes we should say that, but that would be saying the Emperor has no clothes. Does anyone want to hear it?

And Jon is right, this does go beyond RSS. Web services for the most part are predicated on XML namespaced vocabularies, as are any number of behind the firewall integration efforts. In those worlds, there's historically been zero agreement on uniform content models, which is precisely why transformation is such an effective technology for integrating systems. Get the data into XML and start pipelining. And though neither the declarative or the API/RPC school of integration may like the idea of chaining processes with XML, in my and my employer's experience, the results speak for themselves. In truth, XML Namespaces are incidental to a transformation architecture.

Finally, a plea to all concerned. Let's stop punishing RSS syndication for its success by asking it to carry the whole burden of XML usage in the semantic Web.

I don't think anyone's punishing RSS. But the RSS community had a shot to rewrite the web with RSS1.0 and blew it by confusing the simple with the simplistic. The RDF community too had their shot to rewrite the web with RSS1.0 and we blew it, by digging in with an XML syntax that nobody wanted. There's still bad blood from the RSS1.0 wars, but I firmly believe neither statement is controversial anymore, nor is this one - everybody lost.

At the end of the day all proposals to extend RSS via XML namespaces reduce RSS to SOAP - a carrier format for another content model you may or may not have the codec for. The problem with this is not immediately obvious since all we've really done with RSS is render and clickthrough, but that's changing now. The missing piece is the shared content model, which many people believe should be RDF.

By the way, if you don't like all the semantic web stuff that RDF is associated with, here's another way of looking at it. Think of RDF as a CVM, a Content Virtual Machine, out of which any content can be described and by which content codecs can interoperate, by sharing a uniform view of the data. That's all there really is to RDF - an instruction set for content description. This is no more naive a view than Java's WORA.

In a surreal take on Greenspun's tenth rule, Danny Ayers among others have said that RSS would end up reinventing RDF, but altogether ad-hoc and badly conceived. That can still be avoided.

Weblog comments: just say broken.

A while back I blocked comments from this blog. They're back on now (due to popular demand ;)

Why no comments?

The idea was to get people to use track/ping back instead which such a cool, albeit latent, technology.

It's so much more interesting to boldly go somewhere else to read someone else's bold thoughts, than read them here. You can poke about their site for other interesting thoughts, whereas there's not much here I haven't thought of already. Much more potential for serendipity, lo!, and all that interconnected goodstuff.

I'd say trackback has the potential to turn the web inside out in short order. It lets you browse the web backways, from origin to referrer, which is altogether a fun thing. Only Forward is getting tired after a decade. And if the links are enriched, you have the potential to take centralized search and consign it to the sidelines, or more accurately, distribute and parallelize the processing of inbound links to pages across the entire web instead of a privately held super cluster.

Speculation aside, there is a real usability problem with weblog comments in that I'm damned if I can remember where I comment. Comments in weblogs are feature fallout from that other medium - well known, well branded web fora where lots of people come to together to chew the fat. Most of us are not likely to forget where the TSS or /. is, even if we want to to. However I do forget where I left comments in not well known, unbranded, but equally interesting if not more so ornery weblogs. And worse, I know I'll forget, so I don't bother most of the time. If I really want to chew the fat, I do it here, because I won't (probably) forget where my own blog is. Sometimes I add a comment saying 'I said stuff over here' (here). But I hate doing that because there an etiquette vibe thing - nothing definite just a feeling many bloggers are sensitive about comment stickiness.

Otherwise I'm in the hands of lady luck and the aggregators. This is why I often prefix a post with the authors name, not trying to wine and dine mind you, just broadcasting in the hope they'll see it up and read it. I've even considered hacking FOAF or blog URLS into categories. I am without a doubt a desparate man.

So if comments are not going away soon, damn you all, there's a bunch of work, thinking and opportunity to make them usable. For example it couldn't be that hard to keep a store of what posts you commented in a client, that way you could go back to see if anyone flamed you. Maybe some client already does this, but anyone who says hacking web/feed clients is a Dead Zone is nuts. They're nowhere near good enough yet :)

Why is Maven a top level project?

I saw Maven described as being part of the "usual infrastructure" on the Geronimo site. I've investigated Maven as a depdendency manager for work and it's still a ways from what I'd call infrastructure. That may change, Maven is highly active.

I'm not saying it shouldn't be a top level project, just curious why it is, or perhaps why it's not in Jakarta. Given that it's far from baked in my humble opinion, depends on non-top level stuff like Jelly and about all you can do with it is manage Java project dependencies. What's the ASF community's rationale for this?

August 06, 2003

The Geronimo proposal

[PROPOSAL] An Apache J2EE server project'

(Thanks to James for providing the archive link)

Tim Bray: DHTML v Flash

ongoing: Unflash

Insightful analysis on one company's migration from DHTML to Flash and back again.

August 05, 2003

XP myth dispelled: no code ownership

Who's to blame

In Extreme Programming (XP) there is a notion of no code ownership.

Utterly, utterly wrong. In Extreme Programming there is collective code ownership. An XP developer owns all the code. I was going to call this out as FUD, but I actually think it's basic lack of knowledge about XP/Agile.

In his weblog entry @author:Bob the Builder, Meeraj Kunnumpurath writes that in his experience this [no code ownership] doesn't work.

Perhaps that's why XP does not advocate it. But as an aside about Meeraj's post, I'll observe that,

  • if you have to whip your devs into shape with a hall of shame, you have the wrong devs to begin with, or at best the devs are thinking about things the wrong way and need educating. Hiring one cowboy is forgivable, but hiring a whole team of them strikes me as careless.

  • @author tags rapidly become meaningless over time as people work on different parts of the code. Certainly some people will concentrate on specific areas of the code, but the idea that one person only ever authors some unit of functionality (class, package, jar, whatever) and is therefore in toto responsible is a reality distortion.

  • I understand where Meeraj is trying to get to with a process, but what you really want, are people who will take personal responsibility for improving systems. Claiming bits of code, or forcing people to claim bits of code isn't by itself adequate. Put another way, it's playing not to lose rather than playing to win

Lorem ipsum dolor sit amet...

Lorem Ipsum - All the facts - Lipsum generator

Wonderful tool, reminds me of Art College.

Excellent piece by Kevlin Henney

The Taxation of Representation

One of the best things I've read on data modelling (or more accurately, implementing data models) in a while.

There's normally plenty of discussion surrounding data models, but just not enough on the implementation costs of a model.

If you liked it, chances are you'll love Bill Kent's writing. And for those curious about calendars, Ed Rheingold's Calendrical Calculations is a fascinating book (with code in Java and Lisp).

  

August 04, 2003

RDF Name Service

Bill Kearney: Documents vs triple stores?

Following the TAG list lately, I'm starting to think that RDF needs a DNS. Bill nails the use case.

Where this becomes an issue for me is how to tell that statements made in a Foaf file are authoritative. As in, how can I state that this is MY own Foaf file and that statements I choose to make about resources I control are to be considered authoritative. This, to many people, is a very obvious need. When you start smushing data together it gives rise to possible problems where the statements I make about something under my direct influence are contradicted by others. I'm not expressing this as a control-freak issue but as a genuine concern.

If you look at how DNS works, you's see it has the notion of authority built-in:


dehora:~ 504$ nslookup www.ideaspace.net
Server: ns1.tinet.ie
Address: 159.134.237.6

Non-authoritative answer:
Name: homepage01.svc01.clickvision.com
Address: 208.171.83.4
Aliases: www.ideaspace.net


That's telling me the query did not go back to the authority for Bill's zone, so caveat lector. This allows DNS itself to scale as well as controlling the amount of it that goes over the Internet, which can be substantial. It also improves responsiveness. At work we recently installed a caching DNS server behind the firewall at one of our offices. The difference it makes to browsing is evident. Now consider that triples queries will be machine generated and so happen very likely at rates orders of magnitude more than humans can force today. A web where everyone has their own spider (Metcalfesque predictions of web meltdown are no doubt imminent)

Something like this would be desirable for triple lookups, or any web-centric fact base.

When everything's dumped into a repository there's going to be a little trouble 'being sure' that you've got correct stuff dumped into it. The web services folks are perhaps scratching their heads here, thinking 'well, duh, just go resolve the URL and parse it' and they're correct.

Well no, they're wrong if they say that. Sometimes you don't get to go back to the authority. Deep down, the web is a mish-mash of caches, mirrors and geographically located content delivery networks all doing their bit to make it scale. Often, the physical activity involved in routing requests and responses between machines has little bearing on the logical structure of the web -being the clients and what web-heads call origin-servers. Arguably web caching isn't sufficient for the semantic web and is an undesirable munging of layers (the output of your reasoner suddenly depends on a Pragma header).

For the most part, always going back to the origin doesn't scale, inferential fidelity be damned. To make that work we'd need a P2P, not client-server architecture, or much mroe likely, we need to build the semantic web loading tools with caching and staleness in mind, if not the reasoners themselves. Technologies like client-sided caches and offline access need to improve, greatly. For the reasoners proper, Truth Maintenance Systems are the first port of call.

YesWiki

Burningbird: NotWiki

Wikis favor the aggressive, the obsessive, and the compulsive: aggressive to edit or delete others work; obsessive to keep up with the changes; and compulsive to keep pick, pick, picking at the pages, until there's dozens of dinky little edits everyday, and thousands of dinky little offshoot pages.

Such attention to detail often makes for great content. On the other hand, the state of the atom wiki shows perhaps, that there hasn't been enough of this. We need to find our inner editor :)

What we need now is a hold moment. We need to put this effort into Pause, and to look around at the devastation and figure what to keep and what to move aside; and to document the effort, and its history, for the folks who have pulled away from the Wiki because of the atmosphere.

Sounds like a code freeze, which is a process smell. But in this case I agree - atom wiki is in real need of a spring cleaning.

The atmosphere is that which the contributers brought with them - smart, bitchy, agressive, polarized, individualistic. The wiki if anything dissipates the negative connotations of that energy. The value in being moronic plummets when you realize anyone can remove your words. For your words to remain, they must be of value. You are in a sense coerced into to make a positive contribution. On a wiki, negativity doesn't scale.

But Wikis also favor enormous amounts of collaboration among a pretty disparate crew, which is why there's also all sorts of feeds being tested, and APIs being explored, and a data model that everyone feels pretty darn good about. So one can also say that Wikis favor the motivated, the dedicated, and the determined.

DoublePlusGood :)

We need to record what's been accomplished in a non-perishable (i.e. not editable), human manner. No Internet standard specification format. Words. Real ones. We then need to give people a chance to comment on this work, but not in the Wiki. Or not only in the wiki. Document the material in one spot -- a weblog. After all, this is about weblogging -- doesn't it make sense that we start moving this into the weblogging world again? Not bunches of weblogs, with bits and pieces.

I think if it was only about weblogs, none of this would matter very much. But I suspect this is about the near future of both content syndication and content notification. And that's any content. RSS is going to be used in a lot of unexpected places in the next eighteen months - for example, it's a great fit with web services asynchronous messaging, which depending on your point of view is either a land of opportunity or the latest revenue nightmare. Owning and influencing direction is everything.

Community backlash, wikis are fine

Sam Ruby: Wiki backlash?

No. Wrong mindset approach backlash.

A wiki is a collection of living documents that make a space. It's not a version control system, it's not an archive, it's not a cms, it's not fora, it's not a mailing list, it's not the web. Above all, it's not a quick fix for a community's problems - look to yourself before your tools. About the best the wiki will do is expose a community's lack of civility in short order.

People like Danny Ayers, Joe Gregorio, and myself have tried to pull things off wiki in more digestible chunks onto our weblogs.

Antipattern alert 1: Some else will clean up.

If you go to c2 wiki, you'll find you don't need a human to filter the content through a blog. That's because people have being keeping the place tidy. It's an internally coherent place and easy to navigate, because people made the effort to make it so.

By contrast, the atom wiki is a mess. Yet, if people had invested a fraction of the time cleaning it up as they had talking about it, it would be ok over there.

Antipattern alert 2: RefactorOK.

can't state this strongly enough, so I'll put it in bold red: RefactorIsAlwaysOk. Wiki's do not work if they are not kept tidy, anyone who spent time on C2 could have told you that. The discipline is to learn that you must weed out and delete unneccessary content, especially, but not just, your own. If you don't or won't get that, the wiki ceases to function well as a place. This is no different to keeping your code or your kitchen or your graden under control.

There's a letting go that has to happen to use them properly. The aim of you editing a wiki page is not to express your opinion, or to make point, or be renowned for either, it's to make that page the best it can be, right now. This is the essence of the DocumentUnderDevelopment. If the document is as good as it can be, you don't need a history (which is simply a rationale or an excuse for why it is not as good as it should be), you don't need versions (if it's the best it can be why hold onto a inferior copy?).

I submit you can't do this with the atom wiki under the current mindset, which is highly individualized (that's blog culture) and mildly paranoaic and fearful (that's RSS culture).

Despite this - we need to do better. I continue to explore more ways to make this project accessible to everybody who wishes to participate. Mailing lists. IRC. Face to face. Got a suggestion? Let me know!

Yes, I have a few suggestions ('you', is plural).

Drop RefactorOK as your personal approach to working with the wiki. That concept alone is doing more damage to the atom wiki and atoms' progress that anything else - it just doesn't work. Link from the wiki to your blog if something must be frozen.

The second is spend some of your time on the wiki just cleaning up what's already there to make it better, instead of piling on more quasi-organised content. Just 5 or ten minutes a day will make a huge difference - pick a page and make it better. Stay on top of the little things . For example, if you see a spelling or grammatical mistake, fix it. Especially, try to move pages out of thread mode into consilidated content.

Spend some time on C2, learning why it works. That wiki has been through all this before. If you conclude that you don't like the way it works, or can't get confortable with it, maybe a wiki's not the tool for you.

Use backlinks to organise and classify content. What makes a wiki truly different to the web, is that it has backlinks. We're just getting a taste of this now with track back, but Wikis have always had it. Faceted metadata is virtually free there.

[The Echo wiki]
[Just use a Wiki]
[Free classification using a Wiki]

August 03, 2003

Free classification using a Wiki

Like a Wiki Diamond

Faceted classification is a technique that lets you categorize "things" into multiple overlapping hierarchies. For example a Madonna CD might be categorized under "Artists / Madonna" as well as "Format / Compact Disc" and "Price / $10 - $15". The advantage of this is that it doesn't force a dominant decomposition upon the user (to borrow a phrase from Aspect Oriented Programming). For example, one user might restrict themselves to cheap CD's first, while another user might want everything Madonna, cost be damned.

Hmm. Wikis can do most of this already through backlinks. All you have to do is:

  • Make a WikiWord for your class/facet.
  • Put it in the pages you want faceted
  • Click on the WikiWord page's title to find all pages in the facet.

That covers about 80% of everything I need from faceted classification. Further metadata wonkage can be achieved by embedding RDF in the facet page itself, using the WikiWord URIs to denote something or by running a spider over the wiki and stuffing the results into a TopicMap.

The other 20% could be had by allowing a Wiki query to find the union/difference of two wikiwords, something I noodled about a long time ago, but amn't quite sure how to implement.

If you've got a bliki or SnipSnap you should try it and see how much you can done for essentially no cost. If you don't have a wiki handy, you can play with this one

XML. Hard for who?

ongoing · XML Is Too Hard For Programmers

Sorry, no.

Vtable manipulations are hard. Pointer math is hard. Threads are hard. Garbage collection is hard. Caches are hard. Customers are hard.

XML is not hard. XML is a gift. Move along.

August 01, 2003

Coathangers

e-Government @large

The team started a year or so ago with a total content base of 640,000 pages of content. After one full year of work they had deleted 680,000 pages (not a typo) because of duplication, redundancy or whatever .... and they still had 550,000! How's that work you wonder? A growth rate of close to 100% a year is how. A few months later they now have 800,000 pages.

Relentless

Towards Jython 2.2

Jython, lest you do not know of it, is the most compelling weapon the Java platform has for its survival into the 21st century:-)

A (non-technical, but important) argument against Jython is that it has looked moribund for some time now. I guess this release knocks that on the head.

The (non-technical, but important) other is that 'you can't sell scripting languages into the enterprise'. Maybe so, but you sure can sell results into the enterprise. Let's face it, even Sun recognise now that Java is not a good fit for scripting, and that scripting is a valuable activity in enterprise work, hence the will to support new languages running on the JVM.

In Java, the only other viable option to Jython seems to be using XML as your scripting language. But is programming in XML really in you and your customers best interests? Read Jonathan Simon's article, which is turning heads, and judge for yourself.

Ah, that's your name then

Incipient(thoughts)

Sometimes, I wish people would put their names on their blog. I've been to this blog three times now, and only this morning did I click that it's Laurent Bossavit's. Hi Laurent :)