The point made by
(currently 404ing), and supported by
Malik on the need for Web 2.0 startups to consider scalability
of their systems is valid, as far as it goes. But I believe the
argument is, in part at least, misdirected and doesn't goes far
enough. Web 1.0 is the size that it is more because it has a
scalable architecture than the performance of any local systems.
The big issue isn't that individual companies don't build
scalability into their own architectures, rather that they don't
tend to adequately exploit the scalability of the Web. Definitions
of Web 2.0 vary, but I think two key aspects are
The Web as Platform and
The Architecture of Participation. For applications that
exploit these paradigms to function, there must certainly be some
forward-looking design locally, for example in setting up a
distributed database on the company's servers. But I believe more
importantly there needs to be more work done at the edges, moving
Web-common features to them, away from the application-specific
If the Web is the platform, then as much of the data that isn't entirely application-specific must be exposed in such a way to allow it to be devolved to other parts of the Web. Use of existing Web-based data (i.e. other people's exposed stores) shouldn't be the exception, it should be the rule. This mean that for the system to be robust, there will be more work on caching, and less on the construction of disconnected data silos. Caching is one of the key features that has enabled Web 1.0 to scale. The current caching architecture is agnostic - it works now for human-oriented documents, there's no reason it shouldn't work for machine-oriented data. Sure, in general there is still work to do on data interchange and caching, but there's already a lot in place and a lot more on its way (I personally believe the Semantic Web technologies offer a good route to commodity stores and distributed queries).
The mash-up is the prototypical Web 2.0 Service, but many current systems are much more enclosed, following the Web 1.0 walled garden pattern, jealously guarding parts of their systems which they perceive to be valuable, but which in the Web-wide marketplace have no distinguishing features. Things like identity management are best managed cooperatively, in general sharable data should be shared. For their to be real innovation in the Web space, startups and their developers should be free to concentrate on their Unique Selling Propositions, not reimplementing what is already widespread. The advantage in this approach is that they can be even more lightweight, there is significantly less work needed on infrastructure than in the Web 1.0 Pet Store mindset.
For an example of what I'm talking about here, consider the FeedMesh. A little consortium of blog search-oriented companies which share ping data arriving from frequently-updated sites. A part of that system is closely tied to a specific protocol and model, the data input, the ping endpoints. But the transfer of data between members of the consortium is not tied to the specific data application in question, it's just a stream of XML. What members of the consortium do with that data is entirely up to them. In this case I believe that on change notification, they each go and collect the data from the remote sites, populating their local stores. But it isn't difficult to imagine the feed data from those remote sites as being the currency streamed between FeedMesh members, in fact their has been discussion of it being done that way. Right now the local databases of these companies are probably implemented in widely divergent ways, tailored to the particular service that the companies offer. However, given that they are all receiving, forwarding, processing and storing the same kind of data, there's no reason that they couldn't each use the same commodity software for the non-service-specific parts of their operations. Ok, right now most of the FeedMesh companies are probably building on Apache, PHP and MySQL. But there's significant commonality of their systems on a layer above. There is a shared data model, in this particular case that of syndicated feeds. The data exchanged is based on standards (e.g. Atom or RSS). At this point in time the FeedMesh itself is a relatively closed system, and in that sense not Web-friendly. But in the same way Intranets inform decisions for the Internet, so to can systems like these which have desirable features (another example might be Google Base).
So here's how I imagine a possible future application architecture scenario. A startup's software will consist of three parts commodity to one part unique. The commodity parts will be the bulk of their data storage, with standard interfaces. There are at least two architectures for this in the pipeline - for document/content/feed-related application, Atom Stores supporting the Atom Publishing Protocol should be eminently suitable. For more generic data, RDF stores with RDF/XML interchange and the SPARQL protocol and query language will be available (in fact an Atom Store could be built this way). The unique part of the startup's system will be specific to their application, perhaps receiving data from other sources, user interface and processing this alongside the commodity data. They may well be generating new data, which would be fed back into the commodity store. For business reasons they'll no doubt want to implement some access controls to ringfence some parts of the system. But for the most part, the commodity store and interfaces would be open to the rest of the Web, with data flowing to and from it freely. This would in effect be acting as a cache of Web data. The Architecture of Participation. Because this block of functionality would be something lots of people could use, the many-eyeballs mechanism of open source should ensure that robust scalability is built in. Managing expansion of the user base of the service will mean lower demands on system development (only 1/4 will need specialist development work), scaling up will primarily mean adding more commodity hardware.The fact that 3/4 of the system is commodity, and so could be available off-the-shelf will allow the company to innovate with the remaining 1/4.
In other words, the Web is the Platform, let's use it.
PS. Richard MacManus says he found this post interesting (thanks!) and put in a nutshell a point I was trying to make :
…because commodity data is such an integral part of many Web 2.0 services, then caching in effect acts as a storage mechanism for data.