Before reading any further, please open Shelley's latest post in another tab (or whatever you normally do for a new doc in your favourite browser), have a look, then close it again. Thank you. Nothing unusual about that page, beyond Shelley's usual charming eccentricity.
Now please open this
of the same post in a new tab, click here and there, then close the
tab. Now do the same with this
- you may find yourself clicking a bit more, but please come back.
There's something else to play with in a second. Bit of background
In my own experience there are various little lights that go on in your head when playing with Semantic Web stuff. For example, even though I'm more of a data person than a document person, it took me a long while to realise that RDF could be used for data rather than just document metadata, or even data about data. Or rather that this distinction didn't really matter.
Another little light would be Syntax matters <blink>not!</blink>.
A particular little light which went on for me much only much later was the general idea of the Web side of the Semantic Web. On the one hand I was used to worked with data/docs in closed systems (which were often semantic), on the other browsing the Web. RDF itself fell into place for me in the former pretty quickly, the latter took much, much longer.
A key point here is that links on the Web are data too, and very much the same shape as RDF. A link is a statement of a relation between this page and that page, which the browser can make useful. At the core of RDF is the same idea, generalised into being between this resource (the subject) and that resource (the object), with the relation being typed (the property). It's essentially entity-relation stuff, but with global keys and a protocol (HTTP) for getting more information. We're only now beginning to see tools that can make this generalisation useful. (I wrote up a slightly longer version of this angle in Evolving the Link).
Anyhow, a really old rainy day fun to-do for me was the content-free linked HTML thing as I pointed to above. I've been tempted by it a few times (usually after seeing OPML) but only now got around to it, when it wasn't actually the end in itself.
I've been surrounded by all the GRDDL stuff, and then there's Kanzaki's neat trick, and there've been some other interesting discussions about HTML in RDF. Also it's been wonderful watching more and more big chunks of Web data join the open linked data cloud (the latest bit being Cyc, not sure how many triples).
[tongue approaches cheek]
So, thinks I, why not just make the whole existing Web linked
data? Take every single Web page, turn it into RDF. Yeah!
But that assumes the source HTML contains something that can usefully be rendered as RDF. Fine for microformats, or even structured stuff like Wikipedia, but material with available schemas and coder-hours to do each bit of translation are hard to come by. Then there's the question of how many Googles of storage and cycles of MapReducishness it would take for the whole lot.
But the thing is, pretty much every single page on the Web contains some useful semantic information - links. What things like linked data tools and GRDDL demonstrate is that it's not really necessary to map everything in one fell swoop, dump it in a big DB. You can transform stuff in chunks, or even on the fly as you need it, which also means you don't need to worry about storage of the data, you can take the raw material from wherever you find it on the Web at large. Wherever the links might lead you.
There's certainly benefit in collecting data either from existing data sources (as with most of the Open Linking Data material) or from the Web at large into a single, fairly single-domain-centric store to allow efficient access to the (possibly merged) data - this is the idea behind stores like those of the Talis Platform. Being able to filter the data with queries means that even narrower-domain views of the data. For example, my personal store contains quite a mix of stuff, to which I can apply arbitrary queries, but the only practical application I'm currently using on it is for activity logging. But although this apparently a localised database, logically this discrete store is a cache of a chunk of the Semantic Web as a whole, linked data, the data of links.
So anyhow, this afternoon I got my nose stuck in PHP (I can just drop that onto one of the Talis servers without thinking about admin). Just like the HTML link extraction to skeletal HTML, I've also done a bit to take any HTML and produce RDF (with fairly time dc:relation connections), and a bit that does the loopback thing to push any linked material back into the transformer (that one makes them rdfs:seeAlsos). Here's the entry point. The process is just HTML Tidy then XSLT.
I had no joy with Tidy in PHP (not installed on that server) or XSLT (no idea why that didn't work, I've got the same chunk of code working elsewhere), so for both I've taken advantage of the W3C's online service endpoint.
If you look at the URI construction the looping should be reasonably self-explanatory, all the source is in Subversion - most of what's there are test files, the code is really just the one PHP script gluing stuff together, with 4 alternate, fairly trivial XSLTs. An initial HEAD is done to test whether the target is already RDF/XML, which is passed through directly, anything other than that or HTML will likely break.
The code is rather horrid (I don't know PHP, really does seem hard work), there are a lot of obvious improvements but I've had enough of it for now. There's some oddness, in that certain URIs, like those of Wikipedia, take you back to the HTML Tidy page, and relative URI handling doesn't seem to work either (not hard to fix PS. fixed). Creating the content-free Web, what an afternoon's work...
PS. Found the immediate cause of the problem with Wikipedia - putting their URIs into Tidy yields 403 Forbiddens. Hmm.
It kind-of works in
Tabulator (though the
rdf:about URIs and general semantic are seriously questionable),
but the look-ahead does seem to get it choked pretty quickly on
'Course my work here isn't done - guess I should set up GRDDL profiles for the stylesheets, though I'd appreciate suggestions on the best terms to use.@en