" /> Bill de hÓra: February 2004 Archives

« January 2004 | Main | March 2004 »

February 29, 2004

trackback fixed

Thanks to Bob Wyman for pointing out my trackback was busted. Normal transmission resumed.

Atom: the three hardest problems

You'd think speccing a feed format would be straightforward, but the way things are going on the atom-syntax list over the last few days, Atom will have make a best-effort to address the versioning and naming problems to proceed.

Now, where's the cache-invalidation thread at?

February 28, 2004

XML options in Java and .NET

As for whether the Sun's approach of just providing interfaces instead of concrete for XML parsing was such a great thing in Java I'd claim that it's been hit and miss. - Dare Obasanjo, on XML factory loading

I think we'd agree in Java-land that cross-platform APIs have been a mistake (except perhaps for SAX). As for the whole factory and dynamic loading model for raw parsers, well that can get extremely messy in Java. Most of us have run into some form of Xerces hell at some point. To be fair that is usually a problem induced by the Java classloading architecture (what architecture?) rather than XML APIs. I suspect that the .NET loading model isn't much better, but .NET has the luxury of having fewer things to load as Dare points out:

The funny thing is that even if we shipped functionality where we looked in the registry or in some config file before figuring out what XML parser to load it's not as if there are an abundance of third party XML parsers targetting the .NET Framework in the first place.

Really, who's going to use something other than System.Xml/MSXML?

XML support for Java however is fantastic, after you muddle through the options. To name just a few, SAX2, XOM, JDOM, XmlPull, XmlBeans and Jaxen are all really very good libraries (and open source). To be fair to Sun, the JAX* set of APIs had to evolve piecemeal and thus are not always consistent, coherent or without mistakes - a case of putting the wheels on a moving car. .NET APIs have had the luxury of coming a bit later.

All in all, I see the use of interfaces or not as a red-herring here. It comes down to what value cross-platform APIs have (if any), how dynamic implementation loading is managed in a static context and whether you actually need multiple implementations in the first place.

Under the hood at PubSub

I got into an email conversation with Bob Wyman a while back about the PubSub feed aggregator. With his permission I'm blogging about the PubSub architecture and internal processing model.

Bob asked that I don't paint a negative picture of being anti-XML and I hope I haven' t done that - PubSub doesn't strike me as anything other than great service. For those of you that aren't XML obsessives, Bob has taken some heat in the XML community over the last year for promoting binary infoset approaches. So when I asked if he was using binfosets, he responded:

It's not binfoset exactly. What we've got is set of machines that talk to the outside world and convert XML to and from the ASN.1 PER binary encodings that we use internally. (We use OSS Nokalva's ASN.1 tools) The result is great compression as well as extremely fast parsing. In an application like ours, we have to do everything we can to optimize throughput and XML while XML is really easy for people to generate, it just takes too much resource to push around and parse. Currently, we're monitoring over 1 million blogs. Since we're still pretty new, we've still got fewer than 10,000 subscriptions, so there is no real load on the system. We're usually matching at a rate of about 2.5 to 3 billion matches per day and the CPU on our matching engine is basically idling. (i.e. 3-5% most of the time). This is, of course, in part due to the work we put into optimizing the real-time matching algorithm (we need to match *every* subscription against every new blog entry). However, it is also in part because the matching engine never needs to do the string parsing that XML would require.

It's worth noting that all this is internal to PubSub; the public server I/O is XML.

On XML v Binfosets and the processing model:

My comments should not be read as "anti-XML". I'm simply pointing out a method of working with XML in a high volume environment. Just as people will often convert XML to DOM trees or SAX event streams when processing within a single process or box, what we do is convert to ASN.1 PER when processing within our "box." The fact that our "box" is made up of multiple boxes is, architecturally, no different from what would be the case if we had one thread parsing XML and another working with the DOM or binfoset that resulted from the parse. Our "threads" are running on different machines connected via an internal high-speed network and we pass data between the "threads" as ASN.1 PER-encoded PDUs -- not DOM trees or SAX events.

On PubSub metrics:

As it turns out, the problem of monitoring blog traffic is much easier than it might look. Imagine, if you will, that every one of 1 million blogs was updated twice a day -- giving 2 million updates (much more than what really happens). That is still only an average of 23 updates per second. 23 updates per second isn't a tremendous amount of traffic to handle. It is likely that even an "all XML" service could handle such load although such a system would have much less "headroom" than our system does and would need to scale to multiple matching engines sooner than we will. But, hardware is cheap... For most people, buying more hardware will be more cost effective than going through all the complexity and algorithm tuning that we've had to do. We spend a great deal of time working on the throughput since we expect to be getting much higher volumes of traffic from non-blog sources in the future.

The hardware statement is interesting; it seems to align with the Google view of using commodity boxes while keeping the smarts into software.

On scalability:

There are certainly many examples of XML based systems that handle reasonable amounts of traffic with no problem. Thus, it is likely that there aren't going to be many applications that require the kind of optimization effort that we're forced to make. Nonetheless, it should be recognized that there comes a point where it becomes wise to do something other than process XML directly at all points in a system.

On the value of XML:

I'd also like to make sure you know that there is no question about my appreciation of the strengths of XML. There is no question that if we required all our inputs to be in anything other than XML, we would have virtually no input to work with. XML is so easy for people to generate that the net is literally overflowing with the stuff and there is still much more to come. It may be malformed, filled with namespaced additions (which is often no more than noise...), etc. but we can still manage to make sense of most of what we receive. Things would be cleaner if all data came to us in more strictly defined formats, but it is better to get messy data then no data.

On future interfaces into PubSub:

We will, in fact, be asking some high volume publishers to send us their data using ASN.1 encodings. However, the encodings we ask for will be directly mappable to XML schemas and XML will always be considered a completely interchangeable encoding format. In this we try to stay encoding-neutral. Also, we are already seeing that more compact encodings may be appropriate when delivering data to devices that are on the end of low-bandwidth connections or that have resource requirements that demand ease of parsing. Also, we'll be sending ASN.1 encoded stuff two and from clients that we write ourselves (while allowing XML to be used if one of those clients talks to someone else's XML based server.) Thus, anyone who wants to view our system as "XML only" will be able to do so and anyone who wishes to treat it like an ASN.1 based system will also be able to do so. We will be, as I said before, encoding-neutral.

The main thing I take from Bob's explanations is that PubSub, along with being a fine service, is doing a good job of separating interoperability issues from performance ones, by sticking to XML at the system/web boundary and leveraging ASN.1 PER internally. That helps reduce XML-Binfoset controversy to a kerfuffle. PubSub not the only one working along these lines - Antartica (Tim Bray is on the board) also consumes and produces XML, but internally converts the markup to structures optimized for the algorithms required for generating visual maps. Similarity Systems Athanor allows you to describe data matching plans in XML, but again is converting to optimized data structures when it comes to making matches. The key mistake in interop terms seems to be wanting to distort XML to fit the binary/api worldview or replace it wholesale at the system edges.

RDF pixie dust

Brett gets the boot in:

I'm curious how RDF honestly helps in search. Watching RSS, most people generate crap feeds. Honestly. Expecting people to magically generate good RDF descriptions of their sites is almost laughable. And the obvious gambit of writing some ai pixy dust to automatically generate RDF from someone's ramblings is enough to keep me chuckling for most of the afternoon.

Honestly yes, I am looking to build tools will generate the RDF (indexes and metadata). I want to scrape RDF metadata from structured data, analogous to the way spiders today scrape indices from unstructured data. It's much the same issue, but I figure the signal to noise ratio will be better in the former - at least I don't see how it could be worse. It already looks like one of the first things I'll have to do is recast http server logs and syslog as RDF triples.

Part of this project is about exercising RDF in a domain I understand. After it, I expect to know whether RDF has value outside academia and standards worlds and what that value is. I was a huge huge fan of the technology, even serving on the working group for the best part of year, before becoming deeply disenchanted with where that process and the community at large was going (models, models, models) to the point where I felt I had little to contribute other than ranting from the sidelines. For the record, I'm still a fan, on my third reading of Shelly's book, am waiting for Danny's, and despite my opinions on the process, still have enormous respect for the work RDFCore has done. But I take a strong view that RDF metadata should layer on top of statistical and automated magma, not manual data entry; that is pixie dust. This hetereogenity is what we know works in robotics, reinforcement learning* and hybrid AI or for any technique that has to live outside a closed environment. So I see much less need for the tidy substrate and attention to good modelling the current RDF model-think presupposes. I also think the semweb cake is missing or willfully ignoring a key layer that the search engines are thriving in - the environmental noise of the web.

It's not metacrap, it's meta living on crap.

As for the AI pixie dust, I don't see computing RDF from structured data being any more pixiesh that computing pagerank from a page or computing a spam filter from spam (did I say I like hybrid techniques? :). The truth is, I'm at least as skeptical as Brett, but it's like being skeptical about what a computer can do in light of the halting problem - yes there's a hard limit, but you can still do something useful before you get there.

* and will be needed for IBM's autonomic computing feedback loops, but I digress...

[roni size: heroes]

David Parnas at the University of Limerick

How did I miss this:

Professor David Parnas, one of the world leaders in software engineering, has formally joined the University thanks to a major grant funded by Science Foundation Ireland (SFI).

Awesome. So much better David Parnas, than the fool of a Took who taught me down there (and killed my interest in programming for many years afterwards).

[opm: el capitan]

February 27, 2004

First thoughts on a search project

Web search blows goats. Local search totally blows goats.

For the web case: we need to decentralize search by passing queries around from site to site (trackback chains, mod-pubsub, or hack the bejeesus out of mod_backhand) and allowing sites to generate metadata locally and publish it instead of having spiders reverse engineer from HTML (duh). No matter how fast you can do it; downloading the Web into a cluster and indexing it - in what possible world is that a good idea?

For the local case: same thing, except we do the indexing and monitoring by hanging listeners onto the OS. The plumbing and UI is different but the index material, metadata and plugin models for listerners and indexers should be much the same. We could do lan-wide index sharing over zeroconf, that would be fun, as would a tuplespaces model instead of using mqs or interrrupts. We can of course upload indices to the web or onto your phone.

Let's use RDF for the data. Having seen that people figure using SOAP envelopes is not insane for UDP discovery broadcasts, content management or systems integration, I figure RDF is as production worth a technology as any for search and query. Or possibly an RDF that uses WikiNames instead of URIs.

But basically, a) my continuous build thingy is going to be done in the next two months; b) I can't think of a fun mobile devices project, c) wiki, my favourite web technology is now owned by confluence and snipsnap, d) I badly need better search over all my stuff.

So I'm going to give this 12-18 months. Cool names solicited.

[air: alpha beta gaga]

FridgeCracker

Someday, your neighbours' brats will try to crack your fridge, run denial of service attacks on the washing machine and own your toaster, perhaps defacing your toast in the process.

Ain't life grand.

February 26, 2004

Chris Ferris explains the tech business

Interoperability is an unnatural act for a vendor. If they (the customer) want/need interoperability, they need to demand it. They simply cannot assume that the vendors will deliver interoperable solutions out of some altruistic motivation. The vendors are primarily motivated by profit, not good will. However, most vendors do listen to their customers. - Chris Ferris

February 24, 2004

Subversion One-oh

>applause<. Finally, it's time to get off CVS.

Reinventing RDF

I'm wondering how long it will be before everybody's completely reinvented RDF in the search for what it had all along. - Edd Dumbill

As long as RDF/XML is the sanctioned syntax.

February 22, 2004

Anyone got IDEA 4 eap working with Ant 1.6?

From the alt-tab-up-enter-alt-tab department.

I've been using the IDEA 4 eap this weekend. So far it's better than 3.0x - the interface is cleaner with a nippier response. I like modules (I think). But it seems IDEA doesn't support Ant 1.6 (specifically import). I've been fooling around with jarfiles for the last hour - this is like being back in 2001. Anyone got it working? I'm loath to go back to entity inclusions, plus I want to move some projects onto 1.6 at work.

[note: I bounced the date forward on this entry. It seems javablogs is picking up my feed again, so perhaps someone out there has a hack for this]

Why can't IDEs use Ant for their classpaths?

Eugene Kuleshov asks, Why can't IDEs use Ant as their project files?

Somewhat less ambitious: why can't IDEs (and Java itself) use the Ant classpath declaration structure?

[See also: java -cp classpath.xml]

SemanticIntegration 101: something even an AI grad would know

AI is often said to be largely useless, but if you had done enough of it you would already know this:

In conclusion, it is clear that Semantic Web can be used to map between XML vocabularies however in non-trivial situations the extra work that must be layered on top of such approaches tends to favor using XML-centric techniques such as XSLT to map between the vocabularies instead. Dare Obasanjo

Among other things, you would also know that the an important lesson folks picked up after the AI winter (who's 15 year anniversary cannnot be far away) is that how you model the inert data is key; that's one reason why all the SUO and WebOnt folks are so hung up on getting the ontologies just so, and that there remain a wasteland of decent tools and syntax (they just don't matter as much as abstract data models in the scheme of things). So I guess AI ain't so bad after all; if nothing else it'll keep you out of the weeds.

As for mapping the complicated stuff; we've been doing that for years in Propylon. Our CTO, Sean McGrath, can wax lyrical on this. It's called pipelining, and it's the way to go for systems integrations in general, not just munging a date format (perl will do just fine there). The main advantage of pipelines are an ability to keep recomposing as requirements change. In short - you can keep changing the transformation as fast as the business changes its mind. Try doing that with an XSLT write-only trainwreck.

I see Clemens Vasters has has caught the pipelining bug and that .NET has had it going a good while back in no small way due to Tim Ewald - WSE 1.0 supported a kind of in-memory pipeline for SOAP Envelopes; for Java folks it's not a million miles away from servlet filter chains.

Perhaps that's not representative, but it does seem that the .NET crowd gets the pipeline model. I'll go out on a limb here- I suspect that has something to do with MS programming culture been less inured to object orientation and object patterns. A key thing for XML pipelining is that you want to separate data from process acting on that data, which is heresy in some OO circles. The only process really tied to an XML document is schema validation, and even then the behaviour is so data driven, so late bound it's hardly worth picking it out. Off the top of my head, I can only think of one OO pattern where it's ok to decouple data from behaviour and it's the Visitor. It seems that at the system edges, where XML does matter, functional programming and lazy evaluation are the way to go.

The pipeline is the most important pattern/idiom in XML programming. The difference between it and the semantic web outlook, is that any good XML hacker knows that transformation is also primary stuff, not something to be cast aside as a small matter of programming because the model theories can't support it.

February 21, 2004

HTTP dark matter

It would help to have RDF available to us when defining protocol extensions. Mark Baker.

I used to say that about SOAP ;)

Protocols typically work with a predicate(aka header)/value tuple. If you've ever worked with triples though, you quickly realize the problem with a double; it's not everything you need to know. In this case, you don't know the subject; what has a media type, what is of length 342 bytes, what is identified by this URI? Most protocols, therefore, define their subjects not in the message, but in the specification. For example, here are some possible subjects of an HTTP message. This is fine when you're predefining headers in a spec; there's still a self-descriptive path from bits-on-the-wire to what-is-the-subject-of-this-header. But for extension headers, it doesn't cut it; you've got a predicate (the header name) and an object (the header value), but no subject.

I'm curious to see where Mark's going with this. Like Mark I have a related issue that I can't discuss for various reasons.

In the HTTP case, many of the header tuples are metadata about the representations. But, representations are dark matter - they aren't first class objects on the web since they don't have URIs. Ironically, representations are about as real a thing as you can get on the web (they'll come into your computer if you let them, resources never do that). This 'issue' pops up in RDF circles from time to time. Yet RDF in itself is limited in how it can help with the unamed parts of web architecture, or anything that doesn't have a URI moniker.

HTML elements with class attributes

Jon Udell: Analyzing blog content

I've heard this way of using CSS described as semantic markup. But I can see an army of RDFers wishing Jon used URIs instead of free text inside his class attributes. I don't know if CSS will take URI syntax as tokens, but WikiWords would be a good compromise.

Web Idiocy (PUT, POST, J2ME, Atom, and reliable messaging)

Reciting chapter and verse of a 12 year old spec? I don't give a flying rats ass what that spec says. Russell Beattie

Back in the day, before we understood the value of standards that would've been the attitude I'd expect from, say, a large software company ;-)

Anyway...

Russell is rightly annoyed about all this, but he's rightly annoyed at the wrong things, the wrong people, and that's understandable given how we got here. I take the opposite view to Russ; not having mainstream availability of PUT and DELETE is the singularly most broken aspect of web technology today.

Let's go back. There is a broken spec in this story, but it's not Atom and and it's not HTTP. It's, wait for it... HTML. The reason technologies like SOAP and the Midp and Flash only use those verbs is because HTML only allowed POST and GET in forms. That's where the rot started.

What's the big deal? Well, the hoops you have to go through to do basic messaging on the web are frankly, ridiculous, and it results directly from inheriting the web forms legacy of abusing POST. For example, consider reliable messaging done over the web. The absence of client infrastructure that supports a full verb complement gives leeway to invent a raft of over-complicated non-interoperable reliable message specs. Reliable messaging, by the way, is one area that WS vendors can't seem to agree to standardize - perhaps that's because it's critical to enterprises (read, there's real money in it). But, the point is that there should never been an opportunity to make things complicated in the first place. In my job, we design and build messaging infrastructure, a lot of it happens over HTTP. There's a good amount of pressure to make that infrastructure fit with web forms technology and existing client stacks. Now, to do RM over the web, and do it cleanly, you want the full complement of HTTP verbs at your disposal (esp. PUT and DELETE). With them you can uniquely name each exchange and use the verbs to create a simple state machine or workflow operating over that exchange. Without them you have to use multiple resources to name one exchange plus clients and servers will typically have to inspect the URLs to find out what's going on. Operators and software have to be able to manage this, know all the URLs involved in the exchange, plus the private keys you're using to bind them together behind the firewall. Oh, did I mention that you'll have to reinvent these verbs in your content anyway and then get your partners to agree on their meaning? POST driven exchanges become to a small degree non-scalable, to some degree insecure, and to a large degree hard to manage.

Trust me, it's not an academic issue, and it's not limited to RM; basic content management is in scope too. For those of you that don't monkey about with HTTP for a living, I can sum up the problem of the problem of not having PUT and DELETE like this - imagine dealing with a subset of SQL that doesn't suppport UPDATE and DELETE or Java Collections that didn't have an add() method. It's an insanely stupid way of working. But if you never knew SQL had UPDATE to begin with, and it was useful, perhaps that wouldn't be as apparent.

The irony is, that while some of us are left to compromizing with the fallout from uninformed specs, a number of people think that PUT and DELETE are some sort of esoterica that only a spec wonk could care about. And now, over on the Atom list some people are talking about workarounds. To heck with that. Get Sun to fix the Midp and the W3C TAG to fix HTML/XForms. The latter is worth emphasizing - as far as I can tell, this issue isn't even on the TAG's radar.

Russ, sorry; Atom is not the broken spec and the REST folks are not being intransigent nerds (this time). Argung for a subset of HTTP is not the way to go here, even if it's the expedient way right now for J2ME. Sure there are hundreds of millions of broken clients out there, but what worries me are the next billion clients, not the early adopters.

February 05, 2004

Javablogs trouble

I've done usual change-poll-time-and-update-bounce-dance, but for some reason, my blog feed is not being refreshed by javablogs. The last update seems to be ~Jan 31st. I validated the feed, checked against a few aggregators and as far as I can tell there's no encoding weirdness in the feed.

Anyone else having trouble? I tried posting to the javablogs forum but got a reponse code 500 with a stacktrace... oh well, maybe someone can point the Atlassian massive at this entry instead ;)

February 04, 2004

Speed

We could either optimize every last scintilla of performance out of one of those machines or we could get lots of them working on the data in parallel. The former route costs us lots of time and money in terms of labor costs (developers) and capital costs for a small number of top-of-the-range computers. Also, the outcome of the investment in terms of improved throughput is uncertain. The latter route - lots and lots of cheap "throwaway" machines - will cost us a fixed amount of money (low labor costs as we are not optimizing any algorithms) and we can accurately measure the improvements in throughput we expect to see. Sean McGrath

February 01, 2004

Gaming Technorati?

I put links to my weblog categories on the frontpage yesterday. Turns out this lets me inbound link to myself in Technnorati each time I add an entry. I'm taking the categories off for the time being.

RSS feed for the Apache WebServicesSpecifications list

RSS 1.0
RSS .91
WebServicesSpecifications

Spec Infected: why programmers hate writing web services

WebServices are mostly implemented in the most modern environments like Java and .NET. Can anyone point me to a C binding for SOAP, or even a more difficult one that's implemented for Cobol? Carlos Perez

No, but would we want ever expose such things via a direct binding? Legacy systems living deep in the enterprise, in general don't seem to require web service interfaces; they require web service gateways with data transformation pipelines that can be dynamically brought into the delivery channels. While I think there are avenues of exploration, deep integrations aren't yet something than can be push-button automated by tools. But there are ways to get the job done faster and better.

Consider this - exposing a 24x7 web interface into a mon-fri batch nightly COBOL job system. The COBOL system works correctly and is mission critical to the enterprise in question; not something to be tinkered with. Our answer to that scenario was to accept data and queries as async calls over HTTPS in XML form encapsulated by a standard XML envelope. The back of the web service is a series of pipelines. The pipelines entail auditing, structural validations, content validations, code mappings, pre-matching, data cleansing, statistical capture and conversion to the job a format before leaving the job in a repository. A second process running on a schedule uploads the jobs to VMS vis FTP. A third process collects FTP'd responses from the COBOL systems and proceeds to reconcille the responses with the submissions and fulfil those back the sender of the original XML message via another pipeline as well as publishing the results to other subscribers. The setup has proven flexible, and robust to changes in requirements and semantics.

The idea of blowing out the service from the COBOL; that would be problematic. Herein lies a key issue with the way middleware webservices toolsets prefer to be used - codifying a domain model, then generating web service stubs and wsdl descriptors for deployment to the DMZ tier is, in terms of software process, precisely backwards for deep integrations or repurposing for service oriented architectures. Neither the tools or the local object models should be driving the integration process, they should be supporting it. There are some subtle gotchas. Web interfaces and batch processes are working to different timeframes; this entails an asynchronous gateway, but also impacts system adminstration and operation. There is usually no canonical domain model in an enterprise and perhaps more importantly no time or possibility for agreement on one - I see this as being a serious issue for efforts like the OMG's MDA and the consequent modelling toolsets coming down the line.

Having built the web service, it would be fine to expose, say, WSDL if someone really wanted it, but this is driven from the service design, not the provisions in some RDB/OO/WSDL mapper. Personally, I would see generating WSDL as being a publication exercise more than a design exercise.


As for the Achilles heel of web services :) I've complained about toolkits, but it's not that (as understanding of enterprise integrations grows, the tools will get better). In early 2004, the Achilles heel of web services is the complexity resulting from the sheer volume and lack of coherence in the web services specs and a lack of architectural guidance from the folks generating them - hence the title of this blog. Witness the current ASF list.

[Update: Mark Baker laments the passing of the W3C Web Services Architecture group. Me too - there was some confort to be had by having the likes of Mike Champion, Eric Newcomer, and Frank McCabe thrashing this stuff out. ]