" /> Bill de hÓra: August 2007 Archives

« July 2007 | Main | September 2007 »

August 29, 2007

YouTube via GData

So Google have rolled out a GData API for YouTube. GData is based on Atompub and having working on Atom and Atompub for years, I'm feeling good about that, to be honest. "Google and YouTube use it" sounds better in an elevator than "an application-level protocol for publishing and editing Web resources" ;) Now, if we could just get MySpace and Facebook to use it...

A migration guide from the legacy RPC stuff is available. I don't see a Ruby version on the client library page - could that be right?

In other news, you can now embed Google maps using HTML fragments, specifically iframes - hmm. Mark Baker points at GMapEZ as being more declarative. I see GMapEZ as marginally less clunky, due to not using iframes, but it does need a small hack. It's easy to be critical here, however HTML isn't exactly awash with clean options for embedded content.

August 28, 2007

XMPP matters

Tim Bray:

"There are two problems with Push that don’t arise in Pull: First, what happens when the other end gets overrun, and you don’t want to lose things? And second, what happens when all of a sudden you have a huge number of clients wanting to be pushed to? This is important, because Push is what you’d like in a lot of apps. So it doesn’t seem that important to me whether you push with XMPP or something"

Overrun clients. Unfortunately, this does arise in pull, which is why I brought up feed archive paging and the rate of publish-to-pull, which becomes critical as the gap widens or when the client is downed for while. Operationally, it's the same effect as queues overrunning in push scenarios - messages are dropped on the floor. At that point, you don't care about how traffic flow is initialized any more.

Huge number of clients. This is a job for XEP 60 Publish-Subscribe using a scalable server such as ejabberd. PubSub subscriptions are to nodes, which are logically named; that means you can run a server farm to back subscription services if you want to. And you can build a lattice of subscription nodes if EDA is your thing.

I'd take XMPP as a messaging backbone over AMQP, which Tim mentions because he's thinking about buffering message backlog, something that AMQP calls out specifically. And as he said - "I’ve been expecting message-queuing to heat up since late 2003." AMQP is a complicated, single purpose protocol, and history suggests that simple general purpose protocols get bent to fit, and win out. Tim knows this trend. So here's my current thinking, which hasn't changed in a long while - long haul, over-Internet MQ will happen over XMPP or Atompub. That said, AMQP is probably better than the WS-* canon of RM specs; and you can probably bind it to XMPP or adopt its best ideas. Plus, I'm all for using bugtrackers :)

So I think XMPP matters insofar as these issues are touched on, either in the core design or in supporting XEPs, or reading protocol tea leaves. XMPP's potential is huge; I suspect it will crush everything in sight just like HTTP did.

A real interesting long bet is to wonder how much voice traffic will be done over XMPP. Who knows what's possible with XMPP/Jingle servers running on something like Tamarin or Gaim. Jingle is not well known - essentially it's an extension to XMPP that uses a P2P connection model, and failing that, support for negotiating NAT traversals, STUN, or if all else fails, server relays.


Ant+Ivy

Howard Lewis Ship:

What I love about Maven is dependency management, including the transitive dependencies and all the related scopes.

I also like the (theoretically) integrated documentation. I like the APT format for getting docs written without getting hung up on formatting & etc.

I like building at the project or at the module level.

However, in practice, Maven is painful.

The comments are also worth reading. I've always held that Maven is a good idea, let down by its implementation.

I suspect many people have a bad impression of Ant from the pre-1.6 days and having to do hacks like entity inclusion to manage large projects. Specifically the pain points were subprojects and jar dependencies. This resulted in evaluating/using Maven or a homerolled system. But today, Ant's support for big projects is really quite good, better than Maven's. And Ant is nothing if not solid. If you want to see what a modern Ant based build system can do, get Steve Loughran's book, Ant in Action, or go to antbook.org. I don't see a use for Maven or its conceptual weight, given Ant 1.7+Ivy.

SPOF

Assaf: "The exact reason is what we often call 'single point of failure'.

I agree. WGA appears to be fragile. But that's the point; people who deploy systems like that want to be able remotely deny service, for what they see as good intent. It doesn't always occur that such a design might be used for a different intent, or simply executed in error. You really have to have your thinking cap on when applying that approach. Another nearby incident is Google's termination of its video service. I recall RSS based systems stopped working when Netscape pulled a DTD a few years back.

I wonder how many of these incidents need to happen before people reconsider. Key word: centralization.

Assaf picked up on OpenID as another SPOF. So let's hijack the conversation. I would also point at any markup technology that is based on mustUnderstand as being fragile. Why? Because you are beholden to the publisher's opinion on how the data is to be processed. That's only suitable for a centralized system design, that happens to run distributed. This is one often overlooked reason why SOAP+WS can't ever be a basis for Internet scale, but Atom+Atompub/XMPP already is. SOAP's mustUnderstand flag is suitable for inducing the characteristic of fragility. Of course that was never the intent of mU - it was introduced to stop programs doing local damage. Whereas Atom's foreign markup constraint is designed to induce overall robustness, at the risk of point to point exchanges continuing to work when they shouldn't.

August 24, 2007

links for 2007-08-24

August 23, 2007

links for 2007-08-23

August 22, 2007

links for 2007-08-22

August 21, 2007

Old and busted, new and busted

via Mike Herrick, sprach Erik Onnen:

"The problem is further compounded by the fact that Flex doesn't expose any of the raw response data to us, simply a FaultEvent with little useful information (one of the dreaded stream errors actually). No headers, no body content, nothing. I've seen blog posts that hint at this being a limitation of the FVM's status as a lowest common denominator runtime, Flex included - some browsers don't expose headers to plug-ins so neither can the FVM."

That's so broken it's not even wrong. It's 2007, to hell with bad web clients

links for 2007-08-21

August 20, 2007

Rates of decay: XMPP Push, HTTP Pull

Mike Herrick:

"Clearly Atom can't replace ALL pub/sub use cases, but for every day integration architecture where you want business events / EDA why can't we use Atom feeds? In an extreme case, you might have an event sink requesting the feed every 10 seconds - in most cases every 10 minutes would likely be fine?

Who is doing this today? Any lessons from the trenches?"

One thing. The sliding window. If the publisher is publishing entries faster than the clients are reading them, and the publisher is limiting the number of entries in the feed at any given time, then clients are prone to missing entries. Actually you don't need a fast publisher; a downed client will do.

What you really don't want to do in that case is temporal coupling - synchronizing publishing and polling rates so that there are no window misses. It stops independent evolution.

The second last time I had to solve this kind of integration problem (2004), events were going to be arriving at a fair clip, and so the sliding window is why I and my colleague settled on XMPP as the application protocol.

Another thing to note is that with XMPP, you won't need an Atom feed, pushing individual Atom entries is all you have to do. This lead me to conclude that the "feed" is a construct specific to HTTP, or even client/server polling in general. I am hence glad that Atom allows an Entry to be its own document.

Now that I know about feed paging, I would reconsider using HTTP - albeit over-XMPP is fast, and has interesting integration characteristics, such as event routing to ad-hoc listeners. Feed paging allows you to walk back through a feed archive; going from the "current" feed URL to a succession of "previous" feeds.

In that case, if your client goes offline, when it comes back up, it pulls down the current feed. It can then check the "previous" URL and if it doesn't recognize it, it should pull it down and continue back-paging until the previous is known. The client then resets its "previous" and can within the last unknown previous feed save the entries it doesn't have (each Atom entry has a unique id). Clients do have to keep pager state, and probably a decent number of previously fetched entry URLs.

Feed paging is likely to become an Internet Standard.

Consequently, the last time I solved this problem, I chose a feed paging; the publish rate was also lower. The design was liked, but not taken up for scheduling reasons. The alternative design I and another colleague gave was to reverse data flow to a push style - POST events to a URL (it was not Atompub, but the endpoint did have to return a URL). The work estimate for the POST solution came as less than implementing a feed pager and archiving feeds according to that model. There was another reason for the pager approach; it was deemed easier to secure GET to a feed URL than POST to an endpoint.

Either way, XMPP and HTTP and Atom are all you need for this class of problem.

If I was a bookie, I'd close bets and pay out Sam's long bet as far as XMPP goes.

links for 2007-08-20

Presentation Logic and Django Templates

On Django templates: "Any designers I have worked with would be completely overwhelmed."

I can't agree. Short of providing a CMS with a fullblown editing environment (and remembering that the self-serving nature of tool support means the template language will become complicated), Django's templates are just the kind of stuff you need to give a non-programmer. They successfully split presentation boilerplate from content.

Sometimes you have logic that is presentation related and the "no code" rule means contorting and optimizing your model or your other logic for that presentation need. Presentation logic is a real requirement and it has to go somewhere. The most important architectural precept to know about presentational logic is that there is one to many relationship between it and application logic. A single web based system will often need to support multiple presentational needs. The traditional MVC on 3-tiered diagrams you see tend to miss this point entirely, as they usually only show one presentational layer and one application layer - in reality there are many presentational layers. You really don't want that stuff embedded in your domain.

A good templating design means you have the minimum amount of distraction and irrelevance when it comes to adding new content, but in a way where presentational issues are not pushed backed into the code. Django templates provide enough control structures to display content - expressive enough for presentations but not for models or business programming. This approach to explicitly limit a language is called Least Power.


Originally written on 2006-05-11; still seems relevant

REST design notes

Brian Repko's Blog: Ain't gettin' no rest with REST:

"One really bad way to do it is to leave it up to the client. For a Void, the client does a GET, changes state to void and then POSTs the updated version. For a Reverse, the client does a GET, changes state to REVERSE, creates the reversal, PUTs the reversal and POSTs the original with its state changed. Ok, that clearly will not work so we know it has to be done on the server, so what URI represents this method call. I got a couple of ways to do it and am looking for thoughts on these designs."

If you use GET to obtain the state of the resource, you subject yourself to the whims of caches and proxies on the chain. POST at least brings your request back to the origin server (or your web farm). GET might not proceed further than a browser cache.

"The first way to do this is to make URIs for the methods as well as the properties of your domain. So if you have a "void()" or "reverse()" method, then you could have the following URIs that you would POST to."

Designing to REST says you should send the state of the resource to the server, not ping resource proxies. Orthodox web applications would tend to use PUT.

The fundamental problem here is looking for a RESTful way to do RPC/RMI, instead of redesigning the application to expose resources. In RPC, you send the name of the operation, which is domain specific, either to a service/controller, or perhaps to the actual object you want to affect. The REST way is to send the state of the resource using a fixed set of methods - the methods are all your application can use to affect or determine state. These approaches are not easily reconciled, as has been demonstrated over and over, notably with XML-RPC and early SOAP RPC encoded styles. That's because the RPC style over HTTP means using HTTP methods as tunnels for the domain specific calls. So, this leads to modeling and design issues, but also affects scalability and security. No intermediate HTTP aware system understands your private defined function calls, their caching or security implications.

Typically POST is used for RPC. This can lead people to erroneously conclude that POST is the prime method in HTTP, going as far as standardizing around it. Two, maybe three generations of object oriented programmers have walked into that trap. Don't do that!

If hiding state is a key design constraint for you (often the case in OO designs), the idea of exposing state using representations might feel broken. Good REST designs as they appear in HTTP are all about encapsulating implementation, both the physical details and the private application; details (like "void()"), that aren't assured to globally interoperate. Encapsulating state is not really a design goal.

An exception to all this resource-orientation might be things that are truly 'services' - activity based matters, like searching, imaging, geolocating. In those cases it can be useful to model the call as being into a coordinate space and not just a private method taking arguments.

"The last thing - and probably the topic of the next entry is the need for meta-data for the representation - and in particular how to do versioning for optimistic locking operations - the classic read-read-write-write problem."

For this, look at how Etags function. And watch what happens around Atompub as it get used for content management systems. Also how Subversion implements transactional submits over WebDAV is instructive.


The above was written on 2006-08-07. Since then I've seen some posts that suggest REST style is only suitable for simple things, but not non-simple things like transactions; for that you need real systems design. Maybe, but I belive some (not all) not these positions are about pre-assuming a solution. When you supply examples (like Subversion commits) simple appears to get redefined. My suspicion is that "non-simple" here is like the old saw about "artificial intelligence" - once you solve an AI problem, it ceases to be an artificial intelligence problem.

August 19, 2007

Buddycrap

Moar grafz plz

Brad Fitzpatrick gets with the program:

"There are an increasing number of new "social applications" as well as traditional application which either require the "social graph" or that could provide better value to users by utilizing information in the social graph. What I mean by "social graph" is a the global mapping of everybody and how they're related, as Wikipedia describes and I talk about in more detail later. Unfortunately, there doesn't exist a single social graph (or even multiple which interoperate) that's comprehensive and decentralized. Rather, there exists hundreds of disperse social graphs, most of dubious quality and many of them walled gardens."

Dare Obasanjo says its much harder than that. Because "as they chase the utopia that is a unified social graph" there will be 3 problems:

  • Some Parts of the Graph are Private
  • Inadvertent Information Disclosure caused by Linking Nodes Across Social Networks
  • All "Friends" aren't Created Equal

Yes , yes and yes. But, so what, you say? Is that a good enough reason to keep this data siloed? It's convenient to ignore that MSFT almost certainly have lab and production results that back these issues up.

Let's call the problem of lossy context as you move your contact details around "buddycrap". It's sounds better than "distributed groupware", which we'll see in a minute is another way of looking at this. Dealing with buddycrap fundamentally requires provenance - from where and in what context did this buddy come from?

Groupware Bad. Money Ok.

Saying things like "not all friends are created equal" leads you to want to introduce permissions control and visibility to online contacts - the "solution". It's a problem you see a lot in groupware - for example HR and your manager might be able to see things about you that your colleagues can't. It's tempting then to carry that logical form to online systems - my d00dz should be able to see things my dad can't. Dare's perceived issues are food and drink to enterprise IT consultants.

Since deployed groupware tends to be abominable, you'd think pointing at the social graph aggregation problem, saying "that's groupware" and linking to what Jamie Zawinski said about groupware would be a killer putdown. It might be.

Aggregation is also about money. Social graph aggregation and fluidity allows for better cross-selling. All those recommendation algorithms of the form "you like/bought x and y likes/bought x, you and y might have something in common" work better with larger data sets. Especially if you can jump verticals - such as connecting last.fm data to facebook. So it's gonna happen one way or another. I expect this is why Google is keen to figure this out and soon - they can win big if they can access and quantify over all that data; it's in their dna to do so.

On the balance I think I'd prefer targeted spam and online clubcards to a data silo anyday. So I'm leaning to the idea that access to this data should be uniform and fluid, even if I lose a lot of context. Which is to say, even though I accept Dare's issues, I'll take an 80% solution from Fitzpatrick. That said, I noted that "consumer analytics" didn't appear in Fitzpatrick's goals - or non-goals.

Once again, the semantic web has easily ignorable answers.

Danny picks up on this. True, the semantic web community have been talking about social graphs for years (I won';t be surprised that community is where the term "social graph" comes from). DOAP's been around forever it seems. Will this be another area that takes a pass on semweb technology like RDF?

As well as distributed data graphs, which frankly are semweb bread and butter, the semweb community also has most of what's needed to allow assertions of the form "only people in class x can see item y" that can travel across systems, assuming three things. First, that there are now working solutions for sourcing where a 'triple' (the smallest useful RDF fact) came from, something that was a semweb hot topic a few years back. Second, that you can make such assertions usable to consumers - consider that even domain experts that are paid to do so, get permission schemes terribly muddled. Third, you're actually crazy enough to want to bring groupware style permissions to Web.

Freedom 0. Think of the dwarves.

Here's what I said the last time Mark Pilgrim brought up Freedom 0 - "good luck trying to figure out who owns generated metadata, or anything that was mined. GPLv2 'linking' is straightforward by comparison."

Social graph data is exactly like that. A lot of it is computed; arguably you don't own it at all beyond your account details.

But, on reflection this social graph thing is only part of the picture - we should also be concerned about video game data. Because it's an important freedom to be able to let game characters and avatars switch worlds. A lot of people think video games are more important than social networks, and rightly so.

Forget interchange across social networks. What you want to be worrying about is why can't your WOW dwarf run amok in Vice City, once he adapted to urban life? Why can't Club Penguins roam free in the Second Life? Why can't Solid Snake merc it in Marioworld? Arbitrary restrictions like this are broken. Your game avatars are part of your online psyche, no? Game characters locked into a single game or world is clearly a looming crisis of personal freedoms. Universe jumping happens on TV shows and in comics all the time - this is what we *expect* to happen. That's because TV Shows and comics are to video games as the real world is to social networks.

Walled garden video game universes suck, and should be routed around. People are getting sick of registering and re-building characters on every game., but also: levelling up every time is too much work. Plus, it would be cool to have a network of network games - like the internet, but fun.

I'll have more to say on video games and freedom 0 in future posts.

August 16, 2007

links for 2007-08-16

August 15, 2007

Servlets+JSP design answers

Earlier I asked: "I’m wondering how would one produce a URL space for a blog style archive, using Servlets+JSP, and do so in a way that isn’t a CGI/RPC explicit call?"

I got some answers, all good:

Sam Ruby: "Perhaps via URL Rewrite Filters?" and pointed at a Servlet Filter that does URL rewriting, called UrlRewriterFilter.

Stefan Tilkov (who started all this ;) suggsted decomposing the URL in the Servlet::

    // do something fancy with the path, like decomposing it into 
    // parts, retrieving the entry from the DB, creating an entry object,
    // and setting it as a request attribute
    // simulated here :-)
and said "any decent Java programmer could not help but create at least a small library to help with this, but that’s a far cry from a full-featured open source over-hyped Web framework".

Michael Strasser: "/st/2007/08/15/java_web_frameworks.html: This string is available to the servlet using request.getPathInfo(). The servlet tokenises the string and works with the information."

I would encourage anyone to think for a bit about out how that parsing/mapping might work. And then take a look at this: dajngo url dispatching.

Hugh Winkler: "It is a micro framework. You subclass RestServlet and declare some URL patterns to match, and handlers for them. The base class parses the URI, sets attributes in the ServletRequest object based on the URI pattern, and invokes your handlers.". Hugh is as smart as paint; if you follow the link, he gets into a good bit of detail.

There are three interesting conclusions for me. First, the Servlets spec can't support what Harry Fuecks calls "Parameterized REST URLs", and strictly speaking, you need more than Servlets+JSP. Second, that more seems to be a way to declare mappings, since ultimately you'll want to be able to state what URL maps to what JSP/code outside Servlet code. Third, the API surface area to support this kind of feature seems to be very manageable. To be honest, I was worried I was missing something, but that's I was thinking when i wrote the post - it can't be "done", without introducing a "micro" or "full" framework, for URL mapping support. A Servlets+JSP setup isn't powerful enough to quantify over the kind of URL spaces that are become normal (sets of resources). The Servlets spec is designed to let you quantify over scripts (showing that CGI is foundational to Servlets). Web.xml lets you match a pattern to a servlet, and the matching ('url-pattern') is basic:

  /blog
  /blog/entry*
  /*
  *.html
  *.jsp

that sort of thing. JSPs by default in Tomcat map to a single servlet and what you can do there is pass query params. By the way, Resin supports Perl5 regex mappings in web.xml but it's a non-standard extension.

Sam placed some long bets recently. Here's some short ones, say two years. First, web frameworks will come to adopt declarative mapping files, either regex based a la Django and SpringMVC, or URI Template based. Second, JSR311 will be a foundation technology for Java based web systems - it supports URI Templates. Incidentally, the URL the JCP gives out is http://jcp.org/en/jsr/detail?id=311, but never mind. Paul Sandoz has started the RI for JSR311, called Jersey. Things start to look like this:

1    // The Java class will be hosted at the URI path "/helloworld"
 2    @UriTemplate("/helloworld")
 3    public class HelloWorldResource {
 4    
 5        // The Java method will process HTTP GET requests
 6        @HttpMethod("GET") 
 7        // The Java method will produce content identified by the MIME Media
 8        // type "text/plain"
 9        @ProduceMime("text/plain")
10        public String getClichedMessage() {
11            // Return some cliched textual content
12            return "Hello World";
13        }
14    }

which fits in nicely with modern POJO/annotation style development. I can't wait to see this kind of thing integrated into the Servlet containers


Servlets+JSP design question

Stefan Tilkov: "Personally, though, I believe I’d skip using a framework altogether, and go with JSPs and one or more hand-coded Servlets — I’ve not seen Java Web Frameworks yet that do not attempt to hide the Web from me."

This is an interesting sounding "back to basics" notion from Stefan. However it's easy to forget that Servlets were Java's response to CGI, way back when. Here's is the link for Stefan's entry:

 http://www.innoq.com/blog/st/2007/08/15/java_web_frameworks.html

I'm wondering how would one produce a URL space for a blog style archive, using Servlets+JSP, and do so in a way that isn't a CGI/RPC explicit call? That is, the URLs don't end up like this:

 http://www.innoq.com/blog/entry.jsp?id=java_web_frameworks

with one constraint - "just a servlet" that pulls java_web_frameworks.html direct from a "2007/08/15" folder on the filesystem and byapsses JSP is out. All the response is to be generated via JSP. Would we need to a create framework, however 'micro'?

A counterfactual question to ask next is, if we didn't start out with a URL space in mind, would Servlets+JSP naturally lead us to CGI style call or not?

August 14, 2007

links for 2007-08-14

Phat Data

Data trumps processing

For the last few years, I've been hearing that multicore will change everything. Everything. The programming world will be turned on its head because we can't piggy back on faster chipsets. The hardware guys have called time on sloppy programming. We never had it so good.

We're doomed apparently.

I think that increased data volumes will impact day to day programing work far more than multicore will. A constant theme in the work I've done in the last few years has been dealing with larger and larger datasets. What Joe Gregorio calls "Megadata" (but now wishes he didn't). Large data sets are no longer esoteric concerns for a few big companies, but are becoming commonplace.

The use of RDBMSes as data backbones have to be rethought under these volumes; as a result system designs and programming toolchains will be altered. When the likes of Adam Bosworth, Mike Stonebraker, Pat Helland and Werner Vogels are saying as much, it behooves us to listen.

MyDataCenter.com

The first PC I bought had a 750mb disk (I paid extra for it). One of my favorite tech books is Managing Gigabytes, which was published in 1999. Back then Gigabytes were a big deal. My laptop of a few years later, which my daughter uses today, had, count 'em, *20GB* of disk. Today I have 120Gb of USB storage strapped on the back of my 60Gb T42p with velcro. Some time this week my new latop with 120Gb disk will arrive and I'm already sniffing about for a 160Gb USB drive, or maybe I'll strap another 120Gb unit. I have a Terabyte of storage around the house.

I find it's all very hard to manage, and filling that disk space is no problem.

In less than a decade data storage has fallen through the floor, but more importantly the amount of data to store has exploded. I don't have numbers, but I suspect the world's accessible electronic data is growing at a faster rate than clock cycles, available bandwidth, or disk seek time. If there's going to be another edition of Managing Gigabytes they'll have to skip a scale order and call it "Managing Petabytes". Managing Gigabytes just about covers personal use these days. Our ability to generate data, especially semi-structured data appears to be limitless.


Data Physics

The CAP theorem (Consistency. Availability. Partitioning. Pick two.) suggests that you can't have your data cake and eat it. Every web 2.0 scaling war story I've heard indicates RDBMS access becomes the fundamental design challenge. Google seem to be able to famously scale precisely because they don't rely on relational databases across the board. People experienced with large datasets say things like joins, 3nf, triggers, and integrity constraints have to go - in other words, key features of RDBMSes, the very reasons you'd decide to use one, get in the way. The RDBMS is reduced to an indexed filesystem.

Is this crazy talk? Maybe. Good luck explaining to data professionals and system architects that centralised relational databases are not the right place to start anymore. They work really well. There is a ridiculous amount of infrastructure and expertise invested in RDBMSes. Billions of dollars. Man-decades. Think of what you get - data integrity, query support, ORM, ACID, well understood replication and redundancy models, deep engineering knowledge. Heads nodding in agreement at your system design. Websites in 15mins on high productivity frameworks. Java Enterprise Edition. You'd seem to be crazy to give that up for map-reduce jobs, tuple models, and tablestores that can't even do joins, never mind there's zero support for object mapping or constraints. It's no small ask to let go of these features. Psychologically, the really hard part seems to be giving up on consistency. The idea of inconsistent data *by design* is odd-sounding thing to be pitching, no matter how many records you're talking about. You're in danger of sounding irresponsible or idiotic. But if CAP holds, and you have to distribute the data to deal with volumes, and want to make that data available, consistency takes a bath.

Data as a service

The usual next step to a database approach not cutting it is moving files out to a SAN, probably with the metadata and acces control in the RDBMS, so you can retain some of your toolchain. SANs will become very popular as they come down in price, but a SAN only solves the remote part of storage. Ultimately you'll need a distributed filesystem that allows data access to be logically untethered from block storage and mount points. The big volumes mean you need to be able to write data and not care where it went. And you need keyed lookup for reads built on top of the FS, not in the RDBMS (on the basis that an RDBMS with no joins, constraints or triggers is an indexed filesystem). That will end looking looking something like hadoop, mogilefs or S3 - a data parallel architecture.

On the other hand, if data needs to be distributed because there's so much of it, and managing a lot of data is consequently difficult, but not core to most business or personal operations, a data grid is a potentially huge utility market to be part of.

August 11, 2007

Web resource mapping criteria for frameworks

Here's an axis for evaluating a web framework - how well does it do resource mapping?

Updating web resources

Let's take editing some resource, like a document, and let's look at browsers and HTML forms in particular, which don't a do a good job of allowing you to cleanly affect resource state. What you would like to do in this suboptimal environment is provide an "edit-uri" of some kind. There are basically 5 options for this; here they are going from most to least desirable

  1. Uniform method. Alter the state by sending a PUT to the document's URL. The edit-uri is the resource URL. URL format: http://example.org/document/xyz
  2. Function passing. Allow the document resource to accept a function as an argument. URL format: http://example.org/document/xyz?f=edit
  3. Surrogate. Create another resource that will accept edits on behalf of the document. URL format: http://example.org/document/xyz/edit
  4. CGI/RPC explicit: send a POST to an "edit-document" script passing the id of the document as a argument. URL format: http://example.org/edit-document?id=xyz
  5. CGI/RPC stateful: send a POST to an "edit-document" script and fetch the id of the document from server state, or a cookie. URL format: http://example.org/edit-document

The first option, uniform method, isn't really available to us, unless we intercept all POST HTML form actions with javascript and map them to PUT.

The second option, function passing, we can have if we can name a resource and the code handler for that resource can in turn handle parameters on a POST request (or in the POST body). We have to be careful that GET and HEAD don't result in edits, but typically they'll result in the form being served. Visually, it's nice and clean; we've identified an action surrogate URL that you can use with web technology which doesn't support the full complement of HTTP methods but will send via POST. With a function passing, the resources are still first class citizens, not the methods.

The third option, surrogate, we can have if we can pick apart the URL so we know it's an edit operation on a specific document. We have to be careful that GET and HEAD don't result in edits, but typically they'll result in the form being served. Even though we've increased our URI space as a function of functions, this will do fine as a way to let HTML clients POST new state.

The fourth option, CGI/RPC explicit, names the method instead of the resource. It's non uniform RPC. You also see web APIs follow this pattern. Sometimes that's because Web APIs get built out separately so the API functions get hived off from the actual resources and it can seem hard to see how to integrate a web site with a bolted-on publishing engine. I suspect it's often it's because people think Web APIs are supposed to be like actual APIs.

The last option, CGI/RPC stateful, is non-scalable and violates any number of design tenets; we won't discuss it any further here. It's rare enough anyway (unless you're doing SOAP RPC-Encoded, but then I can't help you) .

Surrogates and function passing solve in particular the problem of HTML forms technology, namely that it subsets the available HTTP methods. The intent of either is to provide a sane workaround for clients that don't or aren't allowed to support PUT/DELETE directly to resources because HTML itself doesn't allow PUT/DELETE in forms and a lot of deployed web infrastructure has crystallised around that. Both are superior to the CGI style, and light years away from gateway/rpc programming in terms of intent.

Naming

There's an important point and it's this - you can only support uniform, surrogate and function passing styles if you can supply all of your documents with URLs. The two CGI options only require that you can name CGI scripts. The rest of this post is about how important it is that your framework supports that per-resource naming feature.

I've noticed that, according to some people, function passing and surrogate techniques are architecturally the same as the CGI explicit technique, So these,

 http://example.org/document/xyz?f=edit
 http://example.org/document/xyz/edit

present the same design value as this,

 http://example.org/edit-document?id=xyz 

because it's all just URLs. And you don't get hung up on URLs. They're opaque and the form of them is overrated. Right? At some carpet bombing level of web architecture, that might be. Other places where you might get away with that position are weblog comments, technical op-eds, or a drunken party where no-one's really paying attention. Otherwise, no. The CGI option is deficient. It promotes a non-uniform method to a resource at the expense of leaving your actual resources unnamed. Your world or resources, that is. Think of it this way - if I made every domain object in your Java middleware non-accessible, and all CRUD ops had to be done by passing HashMaps into Manager or Container classes, you'd scream bloody murder.

Framework restrictions

I suspect people do this CGI/RPC thing because their framework makes them do it, by making resource naming a PITA. All they have logically speaking, is CGI, or raw Servlets/ASP.

Older frameworks on the whole do a poor job of resource mapping. That's because most of these frameworks derive from CGI. What's CGI anyway? CGI is a gateway technology to hook "not web" stuff into the web. CGI was never for REST-centric identification of resources, that's why it's a gateway interface. CGI is a hole left in web architecture, for web dark matter. You know how in Stargate how SG-1 travel to other planets using a wormhole portal left by the ancients and kick weekly ass? A CGI script is like that wormhole.

Let's look at that CGI URL again:

 http://example.org/edit-document?id=xyz

The *actual* resource name is


http://example.org/edit-document

anything after a ? in a URL are /arguments/ for the resource. In this case a key for a document in our domain. What's going on here is that the function has been promoted to a resource. It's now a first class thing in our web application. The resource is the script, not the Document. That's a fundamental shift away from REST-centric development back towards RPC style.

If you want to do REST-centric programming in a CGI derived framework, you'll have to emulate resource identification by hacking around common gateways. That's like using a Stargate to pass through *everything*. Getting into your car would involve a Stargate. Going to the bathroom? Use a Stargate. That's what programmers call pointless indirection. If your default programming model for the web is like CGI (Stargates for the Web), The second you want to go REST-centric you'll be way off in terms of your solution space, because you shouldn't have started "from there" in the problem space. The right problem to start with is "how do I program to a world of resources?".

Resource mapping checklist

The first thing you have to be able to do to program a world of resources is name them. That way they can all have URLs. Then you'll want to map a finite amount of handler code onto requests against them. What you want from a web framework in terms of resource mapping is a language expressive enough to do the following:

  1. Quantify over the set of your resource names
  2. Match subsets of your resource names to particular code
  3. Allow passing of named functions into resource URLs
  4. Allow deconstruction of URLs to determine an actual resource from a surrogate.

If you're choosing a web framework and it can't provide a non-CGI URL for every resource in your domain, pattern match meaningful subsets of URLs to internal code, and, it won't let you either pass functions to resources or remap surrogates to your domain, then it's technically inadequate. It's failed its mandate.

I call these criteria out because I've heard arguments in the past that said CGIs and actions are necessary when you have a rich domain, or just a lot of objects. Here's how that goes - for each action you want to support you need to expose script to handle that action, taking a key as an argument. Perhaps the number of scripts is multiplied by the number of types in your application domain (eg edit-user, edit-blog, delete-user, delete-blog), or maybe you'll pass in type parameter as well. Unfortunately exposing scripts like this for state manipulation has nothing to do with your rich domain - it's a sign your framework is deficient and has lead you down the garden path.

The worst case scenario for web programming is that you can't name your resources at all (some frameworks will not let you do this, Struts being the example I'm most aware of). You're hosed unless you write a micro-framework on top that can distribute names. It's heartbreaking when a web framework won't let you do the right thing and forces you to expose CGIs instead of simply passing state.

August 08, 2007

links for 2007-08-08

August 07, 2007

links for 2007-08-07

August 04, 2007

links for 2007-08-04

August 03, 2007

links for 2007-08-03

August 02, 2007

And the archer split the tree

Jeff Atwood

"At the point when I spend all my time talking about programming, and very little of my time programming, my worst fear has been realized: I've become a pundit. The last thing the world needs is more pundits."

I speculate Jeff saves the world millions of dollars every year by propagating what good software development is. His weblog is a goldmine, literally.

links for 2007-08-02

Got a nail?

Russell Beattie:



It’s easy to think about examples of useful mobile-specific applications that are not “feed-centric”:

* Search - including local search, product search, movie times, etc.
* Maps and Location - unless you want to put your entire route into an RSS feed before you walk out the door, but even then, most people don’t use a map or directions unless they don’t know where they are. :-)
* Messaging - including email, community, forums, chat, etc. - sure you could see new messages if they came in via a feed update but don’t you want to respond?
* Money - Checking your balance? Trading shares? Buying a ringtone of a song you just heard? Shopping for an anniversary gift while you have some downtime? Hard to do with just a feed.

Ok, let's go through that list.

  • Search: Got me there for the form. Search results are lists though.
  • Maps. Got me there for the UI. I bet for prefetching the grids can come back as a list.
  • Messaging. All lists. Even threaded conversations (you ship a list and thread on the client).
  • Money. Invoices are itemised lists. Shopping carts are lists. Aisles are lists. BestSellers. Most Popular. Classics. Wishlists Purchases in the last 6 months. Orders. Featured items. Wedding gift planners. New ringtones. All lists. A double ledger is two lists.
  • Trading shares. Never done it. Trading data are events, use XMPP/SMS. Unless you get batches, which would be lists of events.

I take Russell's point about interaction design over a phone. With the exception of maps data trading his point can summarized as "you forget about forms". I did. I still say a list is a key data structure for shipping data over networks. The less mucking about we have to do with that, the more we can focus on the interactions (the useful stuff). And you know, most forms are lists as well ;) So I suspect the set of interactive things you might do with phones is larger than I knew, and the set of things that can be serialized to list form for phones is much larger than people realize.

Tab switching with SiteMesh

Using SiteMesh and meta tags for menus

In "Dependency Injection with SiteMesh", Matt Raible describes a technique using SiteMesh for "a tabbed menu that highlighted the current tab based on which page you were on." Matt's technique a SiteMesh feature that allows the target page to pass information in meta tags to decorator. This stops the template having to "know" what page (or class of pages) it's rendering. The target page has a meta tag with a attribute value of "menu"

    <head>
        <meta name="menu" content="Authors"/>
    </head>

and the decorator pulls in the menu value as follows:

    <c:set var="currentMenu" scope="request">
        <decorator:getProperty property="meta.menu"/>
    </c:set>
    <c:import url="/WEB-INF/pages/menu.jsp">
        <c:param name="template" 
            value="/template/menu/tabs.html"/>
    </c:import>

which allows the script menu.jsp (not shown here) to treat the menu value as a variable called 'template'. It's a neat trick.

Using SiteMesh and CSS selectors for menus

Here's another technique that uses the SiteMesh's decorator:getProperty tag and leverages CSS selectors instead of scripts.

In the template, tell SiteMesh to fetch the body element's id attribute from the target page being rendered using decorator:getProperty thus:

    <body <decorator:getProperty 
        property="body.id" 
        writeEntireProperty="true" /> >

The 'writeEntireProperty' value tells SiteMesh to take the whole 'id="xxx"' and not just the id value.

Set up your global tab bar using a list, something like this:

    <div id="navbar">
        <ul>
        <li id="tab-home"><a href="/home">home<a/></li>
        <li id="tab-help"><a href="/help">help<a/></li>
        <li id="tab-about"><a href="/about">about<a/></li>
        </ul>
    </div>

and set up a corresponding CSS selector rule, something like this:

    body#home #tab-home a, 
    body#help #tab-help a,
    body#about #tab-about a
    {
        background-image: url( /images/naveON.gif );
    }

this tells the browser to swap a background image when the combination of id attributes on the body and li elements match, ie

  <body id="about">
    ...
    <div id="navbar">
        <ul>
        <li id="tab-home"><a href="/home">home<a/></li>
        <li id="tab-help"><a href="/help">help<a/></li>
        <li id="tab-about"><a href="/about">about<a/></li>
        </ul>
    </div>

will highlight the 'about' tab. Now in each target page you can set the id attribute, for example:

  <body id="help">

SiteMesh will pull 'id="help"' from that page into the template's body element, which triggers the CSS selector.

What's nice about this is twofold. First using CSS for tab highlighting is elegant, idiomatic and fast. Second it doesn't require scripting in your jsp/velocity/etc pages.

Don't do this

I'm a nut for having an empty email inbox, which is to say, a littered inbox really bugs me. But haven't been able to keep up with my dehora.net mail for the last few (some) months. So, a while back I created an email folder called "respond". The idea is, mail you can't respond to you drag into the respond folder for handling later. I thought this was very clever, in a GTD "touch once" kind of way.

It was a really, really bad idea. I'm now more behind on mail than ever, and worse I'm processing it backways - last in is more likely to be first out. Essentially it seems to have gone like this - I'm.not responding to mail I put there, because I don't have time allocated to processing that folder. I can't see what's in it without looking in it, which I forget to do, because my instinct is process my inbox, because... well, I like having an empty inbox. And when I do remember, I don't want to look in it because I feel bad for not answering mail going ages back... so, the respond folder is gone. As of tonight everything has been moved back to the inbox. I bet I get through it by the weekend.

August 01, 2007

Mobile Lists

Bernie Goldbach:

"Good mobile data is inherently easy to use, quick to display and minimalist to ensure rapid download and quick comprehension. I would not want to think that the next generation of mobile internet screens--the ones set to appear in automobiles--will be as cluttered as the script-heavy, widget-filled, distracting experiences that represent most of the worldwide web today. There are strong reasons for a special flavour of mobile web information. The iPhone tries to repackage information attractively. A well-developed mobile web site presents clean-coded information that works well on an iPhone or a smaller mobile phone browser."

I'm new to the whole mobile web thing, but it seems to me, on constrained devices like phones everything ends up being a list. Or put another way, all the sites I find easy to use and navigate (like BBC news online) are list based. So, why not just use syndication feeds for everything?