" /> Bill de hÓra: July 2007 Archives

« June 2007 | Main | August 2007 »

July 31, 2007

Point of origin

Tim Bray:

"For the non-geeks, this means that the feed URI will start with https:, it'll be a secure channel. This just has to happen, because otherwise there's a potential gold mine for a smart bad guy.

What the smart bad guy does is figure out how to (temporarily, locally) hack the DNS, say in a few key Manhattan offices, during trading hours. He sets up a fake sun.com and puts a fake news release in the feed claiming that we're the subject of a major SEC investigation, having first shorted a few million shares. Ouch!"

I doubt https is the way to go for financial statements.

First of all, SSL/TLS technologies that support https links are shall we say, web and scale unfriendly. What they do is to encrypt the "channel" between two parties so all data traveling over it is encrypted from prying eyes. To do that, 2 computers must negotiate to secure the channel, this involves maintaining a secured physical connection between them via their IP addresses. This is antithetical to how anyone is designing big web systems, which is all about not caring what computer served the data, because you can't, because the TCP/IP based web is not a broadcast medium. When you get downto brass tacks, the collective wisdom on scaling websites is about us not having to care that www.amazon.com is backed by a zillion anonymous servers. Generally the reason the Web doesn't fall over every day, as predicted it would by now, is a froth of caches and content networks. It's not https connections to origin servers.

Second, let's look at how this financial feed information is going to get picked up and syndicated around the Web. If my computer connects to Sun's via SSL to pick their financials, fair enough, our point to point connection is secured. But who's to say I'm not going to put that data into my "planet money" feed aggregator so that it can be picked up by downstream clients?

Atom RFC4287, which Tim and I worked on, is explicitly designed to allow entries to propagate through other feeds in this way, because relaying and recategorising entries is how people want web syndication to work. There are features in Atom that allows aggregators to identify entries and state the originating source, but they're not secure and are easily subject to spoofing.

One could say there will be restrictions attached to the redistribution of financial statements via Atom; but that would make accessing the feed so much less useful that current mechanisms, there'd be no point.

So, fuggedabout https for serving up quarterlies.

All of this is why you need to sign the data, which Tim mentions next:

"My hope is that if we and a few others start using signatures, the people who write clients will start checking them. This is the Internet, and we're playing with real money and shooting live ammunition; gotta be careful."

That would be great, except no-one is required to do so,

"Atom Processors MUST NOT reject an Atom Document containing such a signature because they are not capable of verifying it; they MUST continue processing and MAY inform the user of their failure to validate the signature." rfc4287

and generally, signing XML is complicated. But you can opt in, and libraries exist, though I suspect aggregator chains will result in altered signed data while leaving the signature as-is, a kind of syndication entropy that will take a few years to clean out. No matter, you don't want to use anything other Atom RFC4287 if you want to syndicate and sign data.

links for 2007-07-31

Wag the dog

David Berlind:

Last October, shortly after salesforce.com started pushing its Apex platform, I pointed out what makes that platform truly unique. As far as I know, it’s the only business computing cloud that can host code developed by you. In other words, you can write software that taps the very business-oriented APIs of Apex and that code can run on salesforce.com’s systems (instead of your own). The key advantage of this approach is that scalability and reliability — the stuff associated with running hardware — are not your problem. They’re salesforce.com’s.

You could argue that this is no different than what Amazon is doing with EC2 or what the outcome would be if you turned to a Windows or Linux hosting outfit. But it is different. Whereas systems delivered from those environments have no intrinsic business value, the Apex platform is loaded with business oriented APIs for lifeblood functions like salesforce automation and customer relationship management.

Arguing that EC2 has no intrinsic business value, is like arguing that an electrical grid or a telephone network has no intrinsic business value. Speculation: one reason business systems can't adapt is because the assumptions about what the business used to do, are embedded deep in the code. Very deep, not easy to pull out. And not just in the code but in the physical architecture the system is running on. Business "logic" is like bindweed - by the time you've pulled it out, you've ripped out half the garden as well.

At a stretch, I'd say Joe Gregorio's recent theme of "N=1" is driven by not thinking about computing the way we think about electricity - where "n-tiered" is akin to "watermill". As strange as might sound, businesses might get more value from computer systems once those systems stop being optimized around transient business requirements or features. You can customize things, but only above the infrastructure. That seems to be part of the pitch for something like EC2.

July 30, 2007

Shipping notes

Connector: If I understand this correctly, it means that Joyent's collaboration suite is now available as open source. Connector is a rare beast - groupware that isn't crap. Small to mid level companies who aren't ready to move things like calendaring and file management to an online service provider should check this one out.

SmartFrog 3.11.005beta. Deployment is the new Testing. Everyone and their granny is having to push web applications onto laterally scaling clusters, drop patches and security releases. It's more than fancy scp and resetting symlinks - you have push out config files, make sure things come up in the right order, migrate the db, bounce httpd, run smoke tests, flush caches, roll back to a know good state deployments if things go bad And so on. SmartFrog is the only story Java has for this problem*, and given the complexity of the problem space, it's a good solution.

Ivy 2.0 Alpha 2. Ivy is a lovely, flexible solution to dependency management in Java, if you aren't prepared to do bite off maven or don't have a home-rolled solution, Ivy is worth a look; it fits like a glove with Ant. It's still in the Apache incubator, but Ivy counts as 'mature' these days.

JRuby1.0. I'm probably the last person on the planet to mention that JRuby 1.0 shipped.. Once you get past the Ruby hype, there's serious momentum in the form of engineering work been put into making JRuby performant and solid. It's been stable for some time, but having a "One Oh" is a big deal psychologically.

*Monkey. Ok, not shipping yet, but a Big Deal imo. Ruby, Python on Tamarin. Tamarin on IE. Tamarin with a JiT. I'm surprised the weblogs aren't on fire over these. Maybe I'm reading selectively, but this suite seems to be a bigger deal than stuff like Silverlight.

Atom Protocol. Is nearly done. Even though I'm a co-editor, and hence have some bias, this one will be fun to watch. AtomPub sits in a very strange place, as it has the potential to disrupt half a dozen or more industry sectors, such as, Enterprise Content Management, Blogging, Digital/Desktop Publishing and Archiving, Mobile Web, EAI/WS-* messaging, Social Networks, Online Productivity tools. As interesting as the adoption rates, will be people and sectors finding reasons not use it to protect distribution channels and data lockins with more complicated solutions. Any kind of data garden is fair game for AtomPub to rationalize.

* I also know about Puppet, Capistrano, and cfengine. SmartFrog/Puppet would be my choices; though I suspect you could start with Capistrano and move up to either.

Sorry about the comments

Apologies if you've tried to post a comment here and failed. It looks like comment submission is flaking out - John O'Shea, Dan Creswell and Dustin Whitney have all sent me a heads up saying they couldn't post. I've been inundated with spam of late, and it might be that Movable Type is getting bogged down (any comments related work in the admin takes forever or crashes and has done for a while now).

It really is time to make time to get this weblog migrated.

links for 2007-07-30

July 28, 2007

Struts 1 Problems

Update 2007/07/28: Paul Brown has some pointers and ideas on moving to Struts2, tho, I have a quibble with the title :)

I've been known to be critical of Struts 1 (aka "Struts classic") in the past, and in some cases have argued against its use in projects. Saying Struts sucks isn't especially enlightening, so for future reference I thought it would be handy to write down in one place (and dispassionately as possible) the issues I've seen. Where I can think of them, I've mentioned alternative approaches.

Verbosity and indirection. Even simple applications require a number of actions, controllers and form handlers. Large websites end up with very lengthy struts-config.xml files that in turn drive a demand for visual tools to understand them in maintenance. Some issues have been dealt with by extensions (eg DynaForms solving some issues with form beans), but the amount of surface area required to generate pages results in significant complexity and is reminiscent of the situation with EJBs. The important issue here is time to market; if you need to be flexible, be able to quickly deliver and react to the customer promptly over repeated iterations, Struts will not easily support those needs. Any associated maintenance capacity with Struts is driven by Java itself and heavy IDE support an is not inherent to the framework. Specifically, walking up to a Struts app and understanding where all the redirects are going, and how the various orthogonal concerns are managed is normally a challenge, and that keeping Struts apps clean and well organised over time requires effort to fight down entropy..

Limited pattern matching. Struts doesn't support wildcarded or templated URLs. The common workaround being URLs that end in *.do, *.html or *.jsp purely to satisfy the framework's mapping capability. It's hard to embed URL slugs in Struts without building a micro-framework on top to handle them. This can result in messy and inflexible URLs. Technically this can be dealt with, but the framework does not directly support good URL design.

struts-config.xml struts-config tries to capture primarily the flow of application state on the server, by being an awkward representation of a call graph. In doing it misses a key aspect of the web - hypertext. In web architecture, HTML hypertext on the client is the engine of application state, not an XML file on the server. Struts-config.xml tries to say too much - declaring controllers, forward dispatching and validators, etc. The end result is that it's difficult to understand how a Struts application works or is organized; it's often easier to click around the site and follow log traces or attach a debugger. Ultimately the configuration file focuses on the wrong thing, server sided redirection instead of URL mappings to the domain objects.

Alternatives: If you want to see an exemplary example of stating the URL space and view dispatching, look at Django's urls.py. For entirely automated mappings, Zope dispatch or Rails routes are good examples.

Difficult to test or reuse code. Struts over-exposes the Servlets framework in method arguments, making the code difficult to test outside a container.This design idiom also encourages developers to pass down and depend things like ServletContext deep into the object domain. Struts use an inheritance driven API (essentially Java 1990s style), which makes it awkward to reuse code or keep the domain independent of the web framework.

Alternatives: Spring MVC and Struts2/WebWork keep Servlet specific artifacts to a minimum.

Limited template support. The templating system in Struts is Tiles. Although it does allow you to define things like footers in one place, which is good, it's difficult for developers to manage the look and feel of a website independently of its engineering. For designers it's hard to actually see see the layouts - the primary design artefact is the Tiles configuration itself (which is technically unnecessary). This makes Struts an incomplete framework that tends to couple the view and template (granted, JSP is a cause here as well).

Alternatives: In Java teams often migrate to SiteMesh to support cohesive layouts and layout switching, which it does by inverting the rendering flow - pulling content from the JSP output into the template. Django templates' page blocks along with template inheritance achieves DRY for both includes and layouts. Plone layouts are modular skins and are divorced from content objects. Movable Type and Wordpress have template/skin support by defining function based DSLs a template author can use.

Conceptual mismatch with the Web. Struts is impedant with Web architecture by focusing actions instead of nouns. Specifically forms and actions (common gateways) are emphasised instead of requests to resources. This is because Struts is action/controller driven (sometimes called Model2 in the Java world). Controllers come from the "MVC" pattern, which was originally derived for organizing desktop applications. Unfortunately desktop applications have very little in common with web applications, either in terms of interaction design or engineering. In pattern language terms, the MVC pattern has been reapplied to the Web domain without consideration for the forces induced by the Web. MVC frameworks tend to be inwardly focused contraptions of server sided redirects.

Alternatives: On the web, a suitable pattern is View, Model, Template. A request to a URL is dispatched to a View. This View calls into the Model, performs manipulations and prepares data for output. The data is passed to a Template that is rendered an emitted as a response. ideally in web frameworks, the controller is hidden from view. Note that this framework style is often called MVC anyway, confusing matters somewhat; The key differences are that Views and Templates are cohesive and Controllers are pushed down into the framework infrastructure.

Other supporting design patterns are Decorator for rendering and Proxy/Interceptor for introducing orthogonal concerns (such as permissions checks and entity caching). SiteMesh is a good example of a Decorator and the Servlets spec supports Interception via servlet filters.

Almost any issue mentioned here has a workaround,a compensating pattern, or an extension you can use. The obvious exception is "conceptual mismatch with the Web" - that can't be solved without changing the framework altogether. I'm sure there are many many people who are happy and successful with Struts; this isn't meant to disparage people working with it - once upon a time. Struts was the only real game in town for Java web development. Nonetheless Struts 1 is limited and is showing its age. Starting new projects with it needs careful consideration, despite its continued popularity.

links for 2007-07-28

July 27, 2007

links for 2007-07-27

July 26, 2007

links for 2007-07-26

July 22, 2007

Eventually consistent

Joe Gregorio:

"The problem with current data storage systems, with rare exception, is that they are all "one box native" applications, i.e. from a world where N = 1. From Berkeley DB to MySQL, they were all designed initially to sit on one box. Even after several years of dealing with MegaData you still see painful stories like what the YouTube guys went through as they scaled up. All of this stems from an N = 1 mentality."

One company I can think of off the top of my of that didn't start with N=1 for storage is Google. I think that has to do with the fact the search is one of the few problem spaces for which a relational database isn't automatically deemed the right option. Problems for which databases are reflexively "the answer" seem to be most problems; later on that requires evolution-in-production to deal with the limitations of centralized storage. This is especially so for social networking, where the graph of relations is the key data - what could be more natural than an RDBMS for that?

Hence we now have plenty of introductory material on how to scale physical n-tier architectures backed using relational database to handle planetary class traffic. An interesting takeaway is that it's clearly possible to re-architect data storage on super-busy production systems seemingly no matter where you start from.

Well defined programming models, transactions, easy master/slave replication, predictable scaling/cost characteristics, big expertise pool, framework bias, solid available solutions, SQL, DML; these count heavily in favor of relational databases as the axiomatic choice for the data layer. But I think the juice is in the promise of data consistency. Getting people to compromise the idea of consistent data and ACID semantics for something like high availability (HA) is a huge challenge. I suspect plenty of people don't realize that HA and ACID are in conflict for larger values of N and where the data is geographically distributed.

Another thing with getting peple to go relax database designs, is that an available, scalable RDBMS system where N=1 will tend (I think) to have a physical model that looks like something an application developer throw together over a weekend, not the well designed model of a professional DBA. Professional DBAs don't start with redundant data models. They don't have columns for derived data. They don't manage joins in the application layer. They don't say normalization is for sissies. So the reaction from a a data specialist to an app developer saying table joins are going to be too expensive to run is going to be skepticism at best, and I think this is understandable. "Crappy" databases become a nightmare to deal with the second the data has to be re-used. That said, it would be no harm if the DBAs stopped to take a look at the request volumes (especially reads) that app developers are expected to support, and just how much caching infrastructure is being deployed purely to stop database servers from being vaporised under load.

SOA incidently, is also committed to trading off data consistency for some other desirable characteristic, probably partitioning (in the guise of separation of business concerns) rather than HA. In SOA the consistency workarounds are called "orchestration" which sounds a lot more palatable that "application level joins".

July 21, 2007

links for 2007-07-21

July 20, 2007

System Testing

Bernie Goldbach on testing mobile phones:

"One had the most durable finish I've seen for pub operations - you could slide it across the floor, through the Guinness and under the feet of a nearby table and it still held its connection."

links for 2007-07-20

Design for the web

Exhibit 1: Wads and Wads about using ETags to reduce bandwith and workload with Spring and Hibernate. Too much to distill into a quote. But Gavin Terrill's article is a great read; he does things like making sure not to use any machine/physical context to calc the Etag, so it'll be consistent across a cluster of servers. Frankly, awareness of this sort of thing is lacking in the Java space. As Floyd Marinescu observed in the comments: "It would be cool to see a generalized etag caching framework added to some of today's modern Java webframeworks." Yes it would.

Exhibit 2: Django's support for ETags, which I can quote: "django.middleware.common.CommonMiddleware: Handles ETags based on the USE_ETAGS setting. If USE_ETAGS is set to True, Django will calculate an ETag for each request by MD5-hashing the page content, and it'll take care of sending Not Modified responses, if appropriate." That's it - you're done.

Exhibit 3: Rails support for Etag, again quotable in full: "Rendering will automatically insert the etag header on 200 OK responses. The etag is calculated using MD5 of the response body. If a request comes in that has a matching etag, the response will be changed to a 304 Not Modified and the response body will be set to an empty string."

The relative verbosity of programming languages isn't the interesting thing; nor is typing doctrine. What's interesting is the culture of frameworks and what different communities deem valuable. My sense of it is that on Java, too many web frameworks - think JSF, or Struts 1.x - consider the Web something you work around using software patterns. The goal is get off the web, and back into middleware. Whereas a framework like Django or Rails is purpose-built for the Web; integrating with the internal enterprise is a non-goal.

ETag support is just one example; there are so many things frameworks like Rails/Django do ranging from architectural patterns around state management, to URL design, to testing, to template dispatching, to result pagination, right down to table coloring that the cumulative effect on productivity is startling. I suspect designing for the Web instead of around it is at least as important as language choice.

It's hard to explain sometimes just how time-consuming it can be to get Web things done on some Java frameworks. This post will be a handy thing to point at next time I'm lost for words :)

July 19, 2007

Google numbers

Looks like Google's Q207 numbers are off expectations. Oh well; the street can be myopic.

Most interesting figures - 5% slip in operating margin against growth and a 74% increase in employees during the past year. The former might suggest creeping bureaucracy. The latter is less about the bottom line, but just as important - say what you like about Google's "over the mean" hiring policy, that figure represents massive residual risk.

Language Game

Christopher Lenz's point about needing presentation logic is on the money, so why not go a little further: "Why Not Just Allow Python in Templates?"

Fair enough - here are some things to think about:

Time flies

"In my part of the world, Scala doesn't make Google results front page. Give it a month"


links for 2007-07-19

July 18, 2007

links for 2007-07-18

July 17, 2007

links for 2007-07-17

July 16, 2007

links for 2007-07-16

July 15, 2007

links for 2007-07-15

July 13, 2007

links for 2007-07-13

July 12, 2007

links for 2007-07-12

July 11, 2007

links for 2007-07-11

July 10, 2007

links for 2007-07-10

July 08, 2007


Adam Bien:

My yesterday entry "After one year Eclipse's abstinence, I had to close netbeans and use eclipse :-)" caused too much traffic and overtaxed my jboss server.Because of the high memory consumption the oom-killer terminated not only jboss, but also glassfish v2, james and my databases."

Weird. Adam's pages always seem to get served from origin; repeated requests for resources like CSS files will come back with 304 not modified (good), but not the main entry pages.

"Now I'm really motivated to speed-up the migration of the remaining applications from jboss to glassfish :-). The main reason for the migration from jboss to glassfish was the command-line interface as well as the great JSF based admin GUI and built-in monitoring (with graphical output...) capabilities. I'm curious whether Glassfish will be also killed by oom, or could handle the same load"

Investing in some time in cache configuration might pay off better than a new app server ;)

July 07, 2007

links for 2007-07-07

July 06, 2007

links for 2007-07-06

July 05, 2007

links for 2007-07-05

July 02, 2007

links for 2007-07-02

July 01, 2007

links for 2007-07-01