Sampling@en

Stefano Mazzochi has a good post On Data Integration with Semantic Web Technologies. He discusses the very real problem of data integration with RDF when the models don't quite match. This is certainly an issue for the Semantic Web as a unified data space, but I can't help thinking Stefano's seeing the glass as half-empty when it's actually more than half-full.

For a start this kind of problem appears any time you want to integrate different data models, irrespective of the technologies used. The fact that the web is an environment in which diverse models from anywhere can meet is a case of difficult problems being possible.

A fundamental part of the puzzle is already in place - globally uniform identifiers for entities and relationships, in the form of URIs. But even with these global invariants in place, as Stefano suggests, there is no single model, no global ontology into which individual statements can be consistently placed. But heterogeneity on this level is a feature, not a bug - it's how people and their tools view the world. The effect is that to work productively with the data contained in the global database it'll often be necessary to operate on projected views, rather than directly on the model(s).

There's a simple example in syndication - Atom supports versioning of individual entries where RSS in its deployed form doesn't. Same data, different models. But the way such stuff is used depends on the application. If versioning is important, then entry identifiers within the application will be a combination of the entry id plus the fields that can vary - notably the updated date. Both Atom and RSS can be mapped to such a view, only extra work will be needed to synthesize the RSS versioning. If versioning isn't important, the data can be flattened into a single-version view, presumably the most recent version for an entry.

Stefano has a brilliant analogy with digital audio, calling this kind of remapping process resampling (ok, having built a few samplers maybe I'm biased). However he seems to have overlooked a tool which is already available for RDF-RDF transformation - SPARQL CONSTRUCT. (Does Simile support it yet?)

There is a potential problem with republication of transformed data, in that right away there may be inconsistency with the original source data. Here provenance tracking (probably via named graphs) becomes a must-have. The web data space itself can support very granular separation. Whatever, data integration is a hard problem. But if you have a uniform language for describing resources, at least it can be possible.

@en

Danny Ayers
2007-03-30T10:30:45+02:00

Related
Comments
Edit