A useful post from Adam Bosworth: Where have all the good databases gone. He's talking in the context of Amazon-size enterprise DBs, which generally means RDBMSs. He lists the three things users of (large) databases tend to ask for:
- Dynamic schema
- Dynamic partitioning of data across large dynamic numbers of machines
- Modern indexing
"Users of databases don't believe that they are getting any of these three."
Standing back from what's already out there, what of the alternatives? Personally I'm pretty convinced of the benefits of RDF stores, but would be hard-pressed to say why apart from a woolly intuition of them being like RDBMSs but more flexible and Web-friendly. So it's good to have a checklist. I've not yet seen anything to suggest XML DBs offer any real benefit over RDBMSs for large-scale storage, though there is obviously a great deal of utility around XML at document level.
As an aside, seems to me a cherry-picking Frankenstein system today would probably have an RDBMS backend with part of it implementing a triplestore (RDF/OWL+rules inferencing being done partially in a datalogish fashion through the relational logic with the aid of triggers and the like, and partially through a programmatic layer on top). There would be less dependency on business objects (-EJBs), much more done declaratively, with more of the access occurring directly through query languages: SQL and Sparql. I'd suggest XML's role would primarily be in the endpoint interfaces - a buffer zone of translation through XSLT/XQuery and then interchange over the wire using a mix of domain-specific XML and RDF/XML. Primarily RESTful HTTP, some other WS-* stuff would no doubt be needed for interchange with other existing systems. Most user interface would be outside that through XHTML, SVG, XForms and so on with the aid of ECMAScript, but there would still be a significant role for fat client which would probably tend to be purpose-specific (e.g. sysadmin). But that's just me musing aloud.
Back to Adam's points. The first is well covered by RDF, schema flexibility is an inherent feature. That it can be applied in practice is borne out by convincing anecdotal evidence.
By "modern indexing" Adam means Google-like, i.e. allowing fast full-text document keyword search, with the content of blobs like PDFs included in the indexes. I don't think inside-literal search is as well integrated with the RDF(/OWL) model as it maybe could be, but the fact that Kowari models can have Lucene built in demonstrates it is possible. Transparent XQuery/XSLT on XMLLiteral content would be good too - there are some benefits to the XML DB approach that could be ported across.
I don't personally know much about dynamic data partitioning or what's been implemented to date. But given how the Web can be seen as a single mega-DB in the RDF model (thanks largely to the use of URIs), and there are implemented algorithms for P2P sharing of metadata it seems likely that in principle this is straightforward, though optimisation would presumably need attention if RDBMS-like performance was required.
Focussing on RDF's features, it has a solid logical base which offers pretty much the same capabilities as Codd's relational model but with additional support for ontology-based reasoning. The interchange language is usable and tool support is pretty good. The Sparql language may not quite be finished but it's certainly comparable to SQL. These are all big advantages on top of Adam's checklist.
Overall this would suggest that RDF stores are potentially good DBs, on many points potentially much better than regular RDBMSs (or XML DBs) because of the more flexible model. But for this to be practicable it assumes the performance can be brought to a comparable level as RDBMSs, which if it hasn't already been done would I think would only be a small matter of programming.
None of this means a great deal if there isn't adequate adoption to support the maintainance of tools, but I'm pretty sure there is enough momentum to drive development. The scope for innovation on the (Semantic) Web is quite a carrot. Looking at Adam's list, the agility that triplestores can offer is probably RDF's best feature in this context. If someone can come up with RDFUnit then the eXtreme crowd will start appearing over the dunes like camels to an oasis.
PS. Some material relating to how LiveJournal scaled.
PPS. Must-read follow up from Bill de hÃâra seeking clarity on the coalface.
[Danny]