Energy

There is now a critical mass of people that know about Web data. Call it semweb, whatever. You can see some frustration bleed out, language in Facebook post (alright, I still just want a taxonomy for prostitutes). The ideas of 10 years ago have been fulfilled. We have the Web of Data, in a year or two, the momentum is rolling, different than imagined, but it is here. Now we have to look at other interesting things, that might be useful for humanity. I'm not sure, but anonymous access to the Web seems a good idea. Protocol work. Next.


danja
2012-07-04T19:33:56+01:00
semweb rdf
Related
Comments
Edit

Seki Update

Seki is my little project intended to explore some of the space around the notion of a Linked Data Platform (bit of praxis there, I didn't envision it that way when I started). The W3C have chartered an LDP Working Group, so obviously I'll be watching over there for tie-ins. The approach I'm taking is to build a front end/bridge to a SPARQL 1.1-capable triplestore. So far I've got a rough skeleton down so it can behave essentially as a (very crude) CMS. When I was last looking at the code I hit something of a stumbling block with how best to cover authentication/authorization. On paper it looks like the modeling side of it should be straightforward, though in practice there are a lot of choices, not obvious which are better - Bergi (the Bergwinkl one)) has been putting some time in on it recently, I reckon I'll just follow his lead. Protocol-wise, I think for now I'll just go with HTTP Basic. Seki uses node.js and I get the sense that it'll be very straightforward to wrap the appropriate parts in HTTPS. (I think when I asked around, Hixie's suggestion was Basic over HTTPS).

My intention was once Seki was fairly usable I'd slap it on hyperdata.org, play with it live there. As it happened the DB behind the Wiki I had running there got corrupted, so a couple of days ago I pushed Seki in its place. It's far from what you'd call fully functional yet, but all I needed right away was it to serve static files, and that it's doing admirably.

Once I've go it going properly with basic CMS functionality (with auth), I plan to have a go at hooking in some of the things I saw at the Salzburg workshop the other week - Apache Stanbol, the VIE widgets and associated bits and pieces. The motivation there is in part that those things are just cool stuff, but there's a slightly deeper reason too. Their design is such that they are strongly componentized, with primary interface everywhere being the Web. Architecturally, IMHO, that has to be the right direction.


danja
2012-06-24T14:45:45+01:00
ldp seki semweb rdf
Related
Comments
Edit

Three phases of the Semantic Web

The slides I presented at the IKS Workshop are now on slideshare (font messed up a bit, I'll have a go at uploading a pdf version later) and at slides.odp. Probably more useful for a skim are the preparatory notes. I think my main quasi-novel point was that historically the (Semantic) Web could be said to have been through three phases:

1. "It's all about the docs"

the traditional Document Web, with a bit of metadata

2. "No, it's all about the things"

the upper-case Semantic Web, reaching a zenith with Linked Data

3. "Ok, maybe the docs are important after all"

the current phase, not docs exor data but a synthesis of what's gone before - all the Linked Data goodness, what we've learnt about REST, with Web APIs and a variety of media types (like JSON plus JSON-LD), all the smarter CMS stuff with natural language processing bits, the search stuff, bringing in RDFa/microdata/microformats, all together with some gentle relaxation of constraints (think schema.org) - and gaining truly mainstream adoption

Apologies to anyone in Salzburg that followed the link I gave in the slides, I'd totally forgotten that the service there was broken. Just spent this morning setting up a live instance of Seki on hyperdata.org to fix that. Well, kinda live, all it's actually doing now is serving up a handful of static pages and giving the crawlers a 404. There are quite a few things I need to fix up - some thought needed around config and most of all I need to get some auth in place, like yesterday. But having it live is pretty good motivation to get things fixed up.


danja
2012-06-22T13:09:33+01:00
cms iks salzburg semweb rdf
Related
Comments
Edit

A first taste of the schema.org carbonated soft drink

I recently realised that in my Seki project it made sense to have any exposed HTML include its own description, amongst other reasons to support IKS-flavoured decoupled content management. I'll use RDFa because the mapping to RDF is more straightforward than HTML5 microdata and there's more comprehensive vocab coverage than microformats. But given that I'm exposing this stuff, it also makes sense to have it understandable by as many consumers as possible. Which pretty much means using schema.org vocabularies (straight RDF representations will also be available via conneg, there I might stick to existing well-known vocabs, see note below).
My initial raft of use cases are around having content that's (loosely) blog post-shaped, but even though schema.org has a section for blogging it isn't immediately obvious how to express this. (Now would probably be a good time to revisit AtomOwl, it got left in a very complicated state, Atom-in-Schema.org would tick quite a lot of boxes).
My typical item looks something like:
<http://hyperdata.org/Hello> a sioc:Post ;
	dc:date "2012-04-02T07:24:53.676Z" ;
	dc:title "Hello World!" ;
	sioc:content "My first post." ;
	foaf:maker [ foaf:nick "danja" ] .
Checking at the excellent schema.rdfs.org I found the following mappings pretty quickly:
schema:articleBody owl:equivalentProperty sioc:content .
schema:author owl:equivalentProperty foaf:maker .
sioc:content isn't quite right in my original as that's meant to be plain text, Dave Beckett's planet:content is probably better - it's like the old RSS 1.0 content:encoded except as a more sensible XMLLiteral. articleBody isn't perfect, for my app or for that matter for a lot of RSS/Atom/blogging-like apps. A more generic content would be better (which might be an articleBody, or it might be a description of the link or whatever, more on description in a mo).
Though I found near-enough mappings, the following suffer similar problems:
schema:name rdfs:subPropertyOf dc:title .
schema:datePublished owl:equivalentProperty dc:issued .
schema:Article rdfs:subClassOf sioc:Item .
name is one of those ultra-generic terms alongside title and label, mixed blessing: very easy to work with but don't offer very much information. For my purposes there isn't much to choose between them. datePublished seemed slightly more suitable than dateCreated or dateModified. Here I would have preferred to be able to use a more generic date, further qualifying only when necessary. Again Article is a bit on the specific side, I want to be able to use this for things like a del.icio.us-style bookmark, for this coverage rss:item, sioc:Item and atom:Entry are all a bit closer. Which leaves:
foaf:nick rdfs:subPropertyOf schema:additionalName .
Near enough.

Top-level terms

I think it would be very helpful if schema.org was a bit clearer about "top-level" terms. Right now Thing has description, name, image, url. Ok, not bad as a first pass against what's needed on the Web. But url is/should be redundant (but that's just my semweb prejudices), there's slight conflict between description and content-oriented terms like articleBody which has the intermediate node of Article. (This isn't a new phenomenon, RSS history is littered with the wreckage of content vs. description, and higher up the architectural tree it's one of the features of httpRange-14). Ok, maybe description is useful enough to leave alone, similarly name is probably reasonable to cover the top level of label, title, name. image I suppose is fair enough, a pragmatic approach to something that could easily get messy if more WebArch was brought into the picture. I guess my recommendations then would be to add a term Item (for a generic Information Resource, superclass of Article etc) and date (for a superproperty of all dates).

Automatic mapping

I haven't yet decided whether or not to use the Web vocab or schema.org versions of the terms in my internal RDF, I suppose I could even use both. But my little experience above demonstrates it's not yet obvious how to map across even with these really common terms. If the starting point was something richer, the amount of work involved could easily explode. Some kind of automation is desirable, for the benefit of someone like me in the current situation, a publisher of semantically marked-up HTML that would like their material to connect with the Linked Data Cloud, or someone writing an app that consumes data across different vocabularies. A service (or two) springs to mind: give it a term and it responds with correspondences from other vocabs, or give it a lump of data and let it offer a translation to the preferred vocab(s)/format. There are at least two approaches to implementation: SPARQL CONSTRUCT and/or RDFS/OWL inference (in both cases the use of generic superclasses/properties could be useful). The front end could offer something like the Rich Snippets Testing Tool for authors together with an open API for translation by app developers, to give a leg-up for integration/mashups. It would be nice if the good folks behind schema.org would consider throwing some resources in this direction.

See also :

Comments to G+ please


danja
2012-04-05T15:13:53+01:00
iks seki rdfa html schema.org semantic semweb rdf
Related
Comments
Edit

AgentRank and serendipity

Agent Rank is described in a Google patent from last year and the implications of it (from a SEO perspective) are discussed in this post. Interesting stuff: essentially factoring people's identities into search particularly through author reputation. Appropriately enough, as evidence that Google is actively working on this a screenshot of Othar Hansson's G+ profile is used (he's listed as Engineering Lead on "The Authorship Project").

This seems a natural progression for them. Another likely example of Google implementing on a large scale ideas that have been floating around semweb activities for a long while. Now they've got the Give yourself a URI bit down (with G+ identities) the rest can follow. The article suggests that the approach will be quite nuanced, incorporating topic information as well. Bravo - anything that makes the soup of the Web more digestible is to be welcomed.

My only concern is a bit on the abstract side. Tim Berners-Lee has often praised the serendipity aspect of the Web, finding things and making connections by (apparent) chance that wouldn't otherwise be obvious. Information reuse is a cornerstone of (Semantic) Web technology, and it's there from the ground up: Roy Fielding says Engineer for serendipity. Whether it's through the uniformity of the interface (as Roy might put it) or of the graph (as Tim might put it), the Web does seem to encourage alignment of resources on similarities, without prejudice.

But any ranking of resources surely has to be done based on known parameters. Serendipity is all about seeing similarities across previously unknown axes. So doesn't AgentRank (and for that matter good old-fashioned PageRank) run counter to this idea?

Here's a recent rather trivial example of serendipity. A couple of days ago I came across this puzzle:

number puzzle

Now I'm pretty certain this would normally have totally foxed me. But I got the solution in a couple of minutes because the night before I'd been concentrating on this little woodcarving project:

business card wood block

(Now finished - end results)

I won't give any other clues to the puzzle, but working on one problem gave me direct insight into the other. Ok, the puzzle is an artificial problem, but what if the key to the Reimann Hypotheses lay in a similarly peculiar direction? The information needed could be out there on the Web already, hidden in plain sight on the blog of a mathematician and that of (say) a woodcarver. If it were, it's vaguely plausible that a semweb-style system that combined the data behind the blogs would see the connection. A text-similarity based system might see the connection too. But if the access to the information was based on the mathematician's reputation in mathematics, the woodcarver's reputation in woodcarving and the crossover of these, there would be no serendipity.

I don't know, the infrastructure of the Web supports serendipity, but how do we surface it?

Comments to G+ please


danja
2012-03-31T10:30:49+01:00
authorrank pagerank serendipity semweb rdf agentrank
Related
Comments
Edit

Lucky SPARQL

tl;dr : how to give SPARQL endpoints an "I'm Feeling Lucky" option and hence support things like WebFinger

Take a query like:

SELECT DISTINCT ?blog WHERE {
   ?person foaf:name "James Snell" .
   ?person foaf:weblog ?blog .
}
LIMIT 1

If I'm asking something like that, then what I'm probably trying to achieve is to get to James' blog. But if use that on an endpoint, what I'll get back is a bunch of XML (or JSON), from which I'll have to parse out the URI, then fire off another GET. So what about having the endpoint server support an additional parameter, something like:

http://example.org/sparql?query=SELECT+DISTINCT+... &action=redirect

which would tell the server to pull out the URI in the results, and return:

HTTP/1.1 302 Found
Location: http://chmod777self.blogspot.com

- thus taking me straight to my actual target.

WebFingering

I've had James Snell's proposal for simplifying WebFinger simmering away in the back of my mind. I'm unconvinced by the architectural style of what he suggests (Gopher?), but he does get bonus points for creativity. (See also James' response on that). In the query above I've used foaf:name which is likely to give ambiguous results. But if it was foaf:mbox_sha1sum instead, you've got a mechanism for WebFinger with James' optimization. Ok, the request URI is a bit cumbersome, but templating a short version for special cases like WebFinger would be easy enough.

PS. A better name might be "Optimistic SPARQL" (and probably return a 404 if the query doesn't return a suitable pattern).

Comments to G+ please


danja
2012-03-29T15:51:36+01:00
sparql semweb rdf gopher webfinger
Related
Comments
Edit

Everyone has a Graph Store

Try this thought experiment.

For practical purposes we often assume that everyone has a computer, a reasonable Internet connection and a modern Web browser. We know it's an inaccurate assumption, but it provides conceptual targets for technology in terms of people and environment.

Ok, now add to that list a Graph Store: a flexible database to which information can easily be added, and which can be easily queried. The data can also be easily shared over the Cloud. The data is available for any applications that might want to use it. The database is schemaless, agnostic about what you put in it: the data could be about contacts, descriptions of people & their relationships (i.e. a Social Graph), it could be about places or events, products, technical information, whatever. It can contain private information, it can contain information that you're happy to share. You control your own store and can let other people access as much or as little of its contents as you like (which they can do easily over the cloud). You can access other people's store in the same way, according to their preferences. It's both a Personal Knowledgebase and a Federated Public Knowledgebase.

So, make the assumption: everyone has a Graph Store. Now what do you want to do with yours? What can your friends and colleagues do with theirs? How can you use other peoples information to improve your quality of life, and vice versa? What new tools can be developed to help them take advantage of their stores? How can you get rich quick on this? What other questions are there..?

Note that if everyone has a Graph Store, for free they automatically get the value-add of the linked data cloud.

Ok, I'm presenting this as a thought experiment, but we pretty much already have all the necessary tools and infrastructure for it to be reality. They aren't generally packaged up in a form that's user-friendly, but that part is becoming increasingly trivial (see below). If you want to run such a store on a local machine there are masses of alternatives - to pick the first three that come to mind there's 4Store, Fuseki and Stardog. If you have a server or other kind of cloudspace available then tools like these are an option there too. For an enterprise kind of environment you probably should look at OpenLink Virtuoso. If you want to leave everything to the cloud, there's Kasabi - note their free hosting option. (I can't remember offhand what other hosted cloud-based options are available, I'm pretty sure there are a few others but a quick search only yielded Dydra which is currently in private beta - please ping me if you know of others...or set up your own :)

The reason I'm prompted to post this now is because of a couple of projects I've had on the go for a while. One (Scute) is an attempt to make my hacking with RDF easier - it's essentially a glorified text editor with a bit of HTTP clientness built in. The other (Seki) was started as a demo more or less to show how a triplestore could be used as a general-purpose read/write Web server, supporting content as well as data. Neither of these is remotely mature enough for proper reuse (Scute has become bloaty/buggy and Seki doesn't do much yet, both are lacking tests and documentation, work in progress innit). But what I found interesting was that although they are approaching semweb tech from a very different direction, there's some definite convergence going on. That convergence is more or less around what I was calling the Semantic Web in a Box (SWIB) a few years ago (jeez, 2006 - tempus fuggits).

The thing is, although this Web stuff does evolve gradually over time, there are also developments that are in effect big steps forward. In the context of the Semantic Web there was the publication of the 2004 specs (solidifying the material that came before), the development of SPARQL (allowing loosely-coupled access to triplestores) and the perspective shift that the notion of linked data offers (bringing the Web back into the Semantic Web). That's not mention the initiatives that have appeared outside semweb cognoscenti circles - things like schema.org.

I reckon SPARQL 1.1 is another big step. Yes, we already knew how to write to the Web with good old RESTful HTTP. But SPARQL Update, Graph Store Protocol etc. offer a standard, loosely-coupled way of writing to triplestores. Ok, a purist may point out that a lot of this stuff isn't RESTful, hence isn't truly Webby. But that doesn't matter - it completes the decoupling of the backend layer (arguably, paradoxically, disintermediating the layers) making it possible to commodify that layer and allow middleware to use generic interfaces, plugging in to any store at one end and potentially any client at the other.

This means the SWIB idea just got a whole lot easier. All it needs to be at heart is a triplestore which supports read/write SPARQL. As noted above, these are already available. I do think the packaging could be improved, to totally minimise the installation effort. One click to download, one click to install, another click to run. A bit of shiny GUI is also desirable, not only to make things easier that the default HTML form for endpoint access but also to reduce the surprise to the end user. It should look a lot more like familiar tools - ideally including something general-purpose (think Microsoft Access) and one or two domain-specific apps (FOAFish contacts/social net client is an obvious one, taking advantage of recent developments a Rich Snippets aware bookmarking app might be nice). A little configuration tool would be good to have too, not everyone is comfortable editing exotically-formatted text files.

Of course it would make me very happy if someone else put a SWIB together like this, dear lazyweb, as it'll probably take me another 6 years to get it together myself. But irrespective of what I say or do on the matter the personal/shared graph store is such a gaping niche that it's bound to happen in some form pretty soon anyway. Whatever, the current absence of "everyone has a graph store" is a conceptual block to imagining the possibilities. So try assuming this is already a done deal.

Comments to G+ please


danja
2012-02-26T15:02:58+01:00
swib federated semweb rdf
Related
Comments
Edit

API Babel

Nothing new here...that's the problem :)

I posted this in a conversation with Nina Jeliazkova and Evan on G+, thought I'd put it here so I could find it again.

Let's say I was setting up an events service for musicians. Following +Evan Prodromou's ref, Portable Contacts would seem in scope for the musicians themselves. Events happen in a location, so one part of the API I'd be interested in is the address stuff. To work with that data it might be useful to use geonames too. It's events, so let me have the place stuff from eventful as well. Geo, geo and geo - with three completelydifferent APIs:
http://portablecontacts.net/draft-spec.html#rfc.section.7.4
http://www.geonames.org/export/web-services.html
http://api.eventful.com/docs/venues/search
The data may be exposed but it's there as a, er, kind of glass silo, it doesn't exactly lend itself to reuse.

Nina remarked:

Not that technically it is impossible to merge the APIs, there is no reason (business, whatever) for them to sit down and merge the APIs. This has happened in network engineering (and other domains) couple of decades ago; there have been many incompatible network protocol/hardware vendors then. It takes time to recognise the value of synchronisation.

She's right, but that time part is an issue. A few years ago everyone was talking about mashups - didn't the value become apparent then? We've had a good modelling language for sync'ing data since (say) 2004 when the RDF specs came out. The data-handling tooling came along with SPARQL in 2008. RESTful good practice ideas have spread widely in the past few years, with linked data I suppose being their counterpart in the semweb world. So why are APIs still so difficult?

Ok, that's glass-half-empty from a semweb perspective. Awareness of this tech has spread. The stuff around Rich Snippets, schema.org and HTML5 microdata demonstrate that the ideas are reaching a wider audience. (Incidentally I was impressed by JeniT's diplomacy about HTML5 in her excellent presentation - but I'm going to start referring to the stuff as HubrisML :)

A personal data point: last week I checked my Twitter "followers" for the first time in maybe 6 months. Around 150 new people. I'd estimate that 100 of them had reference to the Semantic Web (or some closely associated tech) in their profiles. I follow this tech, but still I hardly recognised any of these new folks.

I suspect Mike Amundsen might have a point when he says RDF will languish until it goes hyper (i.e. gain affordances as a hypermedia type). JeniT's talk of using HTML/XML/JSON/RDF for what it's best at probably applies - so how do you bring interactivity to RDF without it looking like it's got a goat's head stuck on it's back? Research needed (high on my list). Whatever, pragmatically the linked data API goes a long way.

Anyhow (once I get my bank balance back in the black) I intend to put a lot more effort into actually using this tech to build human-facing apps. I've a few ideas on how to operate as an Indie, a core one being that taking full advantage of what the Web has to offer (i.e. using linked data etc) offers a business advantage, everything else being equal.

(any comments to the thread on G+ please)




danja
2012-02-22T13:28:58+01:00
apis api affordances semweb rdf
Related
Comments
Edit

Social nets and shared objects

Just checked back on the geek pop video I put up on Tuesday: 111 hits, 4 likes, 1 dislike - heh, satisfactory ratio.

I don't have the energy for advocacy and am not really interested in marketing, but it did get me wondering how you would actually target an audience in this day and age - talking to the right people is efficient communication, right? Clearly folks like Google believe they can target arbitrary demographics with their advertising, identifying the appropriate audience through analysis of user behaviour. Done accurately, it's no longer advertising as such but more about making a connection between some kind of provider and a willing recipient.

In this specific case, the primary target would really be perhaps a person who uses a computer a lot, but only has a minor interest in dev, if any. They probably get most of their desktop software through regular commercial channels, supplemented by dodgy copies of things from their friends. It would be in the interests of this person to know about open source if only in the sense of better software for free. But most of the people reading this will be a hop or two removed from that demographic. Exaggerating for effect, the Open Source Circle has no intersection with the Regular User Circle. How do you find paths through? Ok, maybe there's one that goes [open source user] - [open source geek] - [.net geek] - [MS Windows user]. Yeah, (social) graph problems.

There's potential around communities of interest. Again in this particular case a graphic designer that normally uses Photoshop may be in contact with a Gimp user.

There's an aspect of this I reckon is still really virgin territory, ripe for colonization: I'm sure I've heard better terms but call it "shared objects". My guitar is of generic type Stratocaster, so if someone else has a guitar is of generic type Stratocaster there's a very good chance we've got other things in common. It's close to what Amazon already does around recommendations, but I reckon it could be done a whole lot smarter and in a way that's more broadly useful. It's a Semantic Web/Linked Data idea that's also entirely in scope for schema.org and RDFa/microdata work.

Uldis Bojars did some work around the "shared objects" thing a year or two back, I must pester him again for references.

Comments to G+ please


danja
2012-02-17T13:42:18+01:00
federated social semweb rdf graph
Related
Comments
Edit

Search plus Your World - fool's gold

For quite a while I've held the view that most current approaches to Web search are fundamentally flawed, because the best way to find something is not to lose it in the first place. But as the companies invested in search gradually get smarter in their use of person- and (to a lesser extent) thing-oriented data, rather than just word association (football) search results seem increasingly more focused. Google's approach in particular has grown increasingly like the model put forward in the Semantic Web initiative. Recently with G+ we see a big push to capture and exploit data associated with personal profiles (the FOAF domain) and brands (the GoodRelations domain, although maybe there's a role for an additional brand- rather than product-oriented vocab). With Rich Snippets and Schema.org there's a direct use of semweb technology (in a slightly mangled form - One True Ontology is a well-known antipattern to anyone that bothers to look at the literature).

In fact the "Your World" part of Search plus Your World (SPYW) can be seen as a reinvention of the most important part of Semantic Web technology, that of giving everything of significance a URL: people, places, things, concepts. Given that, you can start describing and leveraging relationships between those resources. To use a phrase I think originated around microformats, it's lower-case semantic web. Ok, behind the quality glitz of G+ profiles and pages this seems to have been done in a rather sloppy, ad hoc fashion, but that in itself is fine - whatever it takes. But where Google get it very wrong is by putting themselves at the heart of their system. Not only is semantic in lower-case, so is web. If you do a search with SPYW enabled, you're pointed straight back into the Google Empire. They are making themselves gatekeepers of the Web. Although there aren't any concrete entry barriers to this walled garden, by only signposting Google's footpaths in search results it's creating a system with the same characteristics as say AOL around 2000. From Google search being a vital accessory on the open Web, it's increasingly becoming a portal.

There is already a visible cost in practice to Google's echo chamber - if you want to re-find something one of your colleagues said the other day, sure SPYW is helpful. But if you're trying to do some original research, you don't want to be searching with Your World blinkers on - an engine without those preconceptions such as DuckDuckGo will be more useful

This strategy I'd assert is doomed to failure for the same reason AOL's walled garden collapsed, to use another phrase I like to repeat, because no matter how big any single entity becomes, the rest of the Web will always be bigger. The focus on the user/Don't Be Evil thing is absolutely right to highlight the value of non-Google resources, although it does fall short by suggesting that the rest of the Web is just a handful of other companies [G+ link] i.e. Twitter, Facebook etc. Google's own long-term survival as a market leader is absolutely dependent on their respect of the Web at large.

So what should Google do? Re-read Steve Yegge's awesome rant [G+ link] for starters. Especially the bits about Platforms. G+ and Your World should be considered in this context - as a semantic (any case) Web (upper case) Platform. For example, while Google's pages appear to be aimed at providing the canonical URLs for concepts (...lower-case). But there's already an excellent source of such URLs : Wikipedia. In itself Wikipedia only provides URLs of documents who's primary topic is the thing in question, but dbPedia is a well-established mapping based on best practices from thing identifiers to Wikipedia pages (e.g. <http://dbpedia.org/resource/Berlin> foaf:isPrimaryTopicOf <http://en.wikipedia.org/wiki/Berlin> . ). If a handful of students from obscure north-European universities (heh, sorry, just for the sake of contrast), with a little community support can create and maintain - give the world - a service supporting all the concepts/things covered by Wikipedia, imagine what the mighty Google could achieve...

To give a little example in the context of Personal Profiles, if I publish my definitive personal profile on my own domain (note Google already understands all the elements of this) then for queries for which "me" is the appropriate response, that page should be the first hit, not my G+ profile.

Another factor in the walled nature of G+ is the limited API. I'm sure features will be added to this in the near future, but I hope (probably unrealistically) they will use proper standards and follow known best practices. Going further into over-optimistic territory, I'll quote Tom Gruber (in an interview talking about how Siri works) :

A site that exposes RDF usually has an API that is easy to deal with, which makes our life easier. For instance, we use geonames.org as one of our geospatial information sources. It is a full-on Semantic Web endpoint, and that makes it easy to deal with. The more the API declares its data model, the more automated we can make our coupling to it.

What should we (as users and components of the Web) do? Well, basically what we're already doing...but trying not to be distracted by shiny things and keeping an eye on the long term - standards are good. When we publish data on the Web we need to consider the quality of the data first (i.e. make it 5 Star), seeing it as purely Google-fodder is missing the point.

Comments please [Google+ link, the irony is not lost on me :)]


danja
2012-01-28T12:59:52+01:00
google semweb rdf spyw
Related
Comments
Edit

RDF, where art though

In comments on a post on G+ I said something I might regret:

"There are plenty of RDF-based applications around, but none really have much broad public appeal."

Ade Oshineye responded with "why do you think that is?"

Ok, overnight I remembered there's at least one app (or set of apps if you prefer) that uses RDF and has a lot of adoption: Drupal. According to Wikipedia it's used on at least 1.5% of Web sites worldwide, and has RDF in its core. Then there's data.gov.uk, a public-facing national government site that's RDF through-and-through. I'm a little out of touch, there are no doubt quite a few other good examples of where I'm wrong.

But given that RDF has been around for 5 years*, it's the way of doing data on the Web and virtually every Web-oriented app uses data somewhere, why isn't it ubiquitous?

(* solid specs came out in 2004 although SPARQL wasn't until 2008 so I'm splitting the difference for a rough date for when it became usable)

RDF isn't something that's going to be in your face anyway, so "broad public appeal" is slightly off-target. Developer adoption may be a better key. Whadever.

In terms of it as a database tech, compared to relational DBs (MySQL etc), custom data handling (Twitter uses Ruby message queues), novel DBs (Facebook uses a key-value store Cassandra apparently) RDF stores don't get much of a look-in. Ok, arguably the big scale things need to be custom to hone performance, but why, alongside the Big Data handling, don't we see RDF augmentation?

For consuming apps and desktop apps, I can't actually think of any well-known ones off the top of my head (I think quite a few of the music apps on Linux use librdf under the covers). I don't have a mobile device - any iPhone apps?

What I find a little bizarre (and please give me counter-examples), is that in the areas where RDF really shines - Web-oriented data integration and reuse - there are hardly any well-known apps out there at all, using any technology. There are a handful of feed aggregators and things like techmeme, but the level of integration there is pretty trivial. (Before Kingsley jumps down my throat - OpenLink Virtuoso is seriously good at this kind of stuff out of the box - but what I'm after is where these things are being used by twitter-sized demographics).

There's certainly something to what Lee Feigenbaum said the other day, the wrong question is usually asked, it should be: What can I do with Semantic Web technologies that I wouldn't do otherwise?

In terms of app-building, right now most parts of most things can be built relatively easily using other technologies, so unless the RDF stack is part of the developer's on-hand toolkit (like e.g. LAMP) it won't be first choice. I do suspect that while the false perception that RDF is complex per se isn't so prevalent these days, there's still a notion around that RDF is complex for the benefits it offers. i.e. linked data isn't perceived as a significant value-add, so why bother? The primary objectives can be acheived by pushing around little JSON objects ("jobbies"?) in a fairly arbitrary fashion, so why look further? But data on the Web surely isn't a niche thing...

Feel free to shoot me down in flames from all angles over this one (I'm not interested in advocacy here so don't care if I expose the wrong message) - I also suspect there's still something in the idea that people simply don't get it. While developers seem to have no problem representing pretty much anything in local databases, the idea that anything can be represented on the Web in a similar way hasn't been grasped. I reckon there's good evidence in virtually every high-profile project. Things tends to be focused on HTML (with a little Javascript) and the browser experience. For service-oriented systems the unwritten assumption is that the services will tie into the same view. I'm certainly not saying that this focus is wrong (those user-facing components are vital), just that it can lead to a blinkered view of what is possible. Only relatively recently have developers at large started looking at things like the identity of people on the Web. You still don't see the same attention given to everything else in the world - products, ideas, activities. Ok, you might point to activity streams and the like, but the subject of those activities still largely tends to be doc-oriented: messages or posts. You might point to schema.org and microdata as ways in which people in the Web development community can put data on the Web. But scratch the surface and the main goals underneath are things like SEO, most of the data being expressed is document metadata, not data about the real world. (Next time you go shopping, notice your interactions with the world from finding your car keys onwards, compare and contrast with the Amazon experience.)

The other day I posted a question on G+ that probably should have gone here: All the necessary components were in place for online social networks, in a distributed form, before Facebook & co. came along: blogs, aggregators, the various protocols. So why were Facebook & co. so successful? (got some good comments there, and was very pleased to find out Andreas Kuckartz is researching the question)

The question of data on the Web seems to lie in a similar socio-politico-technical morass. On federation, I'm afraid I'm inclined to agree with Eric Siegel : "I predict decentralization is inevitable, but its very very far away." I feel pretty much the same about the Web of data, though perhaps not so far away (unless I'm confusing small and far away :)

[ooh - a good point on that from Seb Paquet I'd missed before: The folks who grokked decentralization didn't master social experience design and UI design as well as Zuck, and decentralized infrastructure is harder to monetize so getting funding was difficult.]

One final question dedicated to folks on Planet RDF, from danbri in response to (the Facebook re-presentation of) my post yesterday:

If RDF is so great, we should all be rich by now? :)

Another quote, it must have some relevance - via the BBC, from Sir William Preece chief engineer of the British Post Office in 1876: "The Americans have need of the telephone, but we do not. We have plenty of messenger boys."

Still no system here yet, comments to G+ again.


danja
2011-09-17T13:52:14+01:00
federated semweb rdf
Related
Comments
Edit

Plan B - RDF for fun and profit

Last night, after finding out that part of the G+ API had gone public I skimmed their docs and the docs of some of the specs they draw on: Portable Contacts, Activity Streams and OAuth 2.0. Of course it's great that G+ is exposing an API, and great that they're drawing on existing standards. But after looking at those standards I came away shaking my head, feeling rather discouraged. Again and again they contain data expressed use JSON mappings like "kind": "plus#person" (G+ API) and "objectType" : "person" (Activity Streams) and "" (Portable Contacts assumes that if you've got data you're looking at contacts). Aside from the variation in the naming across these, there's a common theme, the assumption that a simple token (like "person") is adequate for definition of something on the Web. How do you know that their definition of "person" is compatible with your system's definition of "person"? Sure, there are the spec docs to back them up, but how do you get from the data to the spec docs? Ok, there's openness in the publication and dev of these specs and standardization to the extent that they're high-profile enough that vendors like Google will see them and adopt them. But in their technical detail they have more in common with pre-Web, offline proprietary formats - "person" means person because we say so, and everybody knows what we mean.

Digging a bit deeper there's reference to the Discovery Protocol Stack which draws on XRD (the OASIS spec for describing resources) and Web Linking (RFC 5988 for defining typed links). Here there's more of an attempt to make the stuff Web-friendly, entities (resources) and relations (links) are identified with URLs so Web-based discovery of further information is in principle possible. But the "One True Ontology" registry-based approach of Web Linking is questionable in a distributed environment (and comparable to schema.org).

The description of things using schema like "kind": "plus#person" looks like what RDF does, except rather than using a Web-based approach to naming (so you could derive a URL from "plus#person", look it up and find out what it means) instead we see ad hoc token-based naming schemes. With Web Linking we have something that corresponds exactly with RDF properties (they are typed links), and if you can look things up in a registry then that's a step in the right direction. We already use registries to decode the meaning of terms in other major vocabularies - e.g. the HTTP media types through which HTML is delivered lead you to the definitions of terms like "strong" in the relevant specs. But is a registry appropriate for every term we're ever going to use? Does a word like "strong" only have one meaning?

Ok, so far there's a phrase which sums up all this: Cargo Cult RDF

But the theory is that grassroots, use case-driven development will tend to create cowpaths in the environmnent, and all standards orgs have to do is pave these. Except it doesn't seem to quite work that way. On the one hand we have the XKCD Standards effect (check the first paragraph on the Portable Contacts page), on the other hand the simple fact that, even with the best will in the world and with good information, people often get things wrong. Take for example:

OAuth [1.0] aims to unify the experience and implementation of delegated web service authentication into a single, community-driven protocol.

[time passes]

OAuth 2.0 is a completely new protocol and is not backwards compatible with previous versions....As more sites started using OAuth, especially Twitter, developers realized that the single flow offered by OAuth was very limited and often produced poor user experiences...OAuth 1.0 was largely based on two existing proprietary protocols: Flickr’s API Auth and Google’s AuthSub. The result represented the best solution based on actual implementation experience. (Introducing OAuth 2.0)

So...even when good, informed standardization is aimed for, flawed technologies built with flawed processes are unavoidable.

But these things are so popular! Vendors and developers can't get enough of this kind of stuff. It's a continuous stream: XML APIs become JSON APIs, microformats become microdata, but the same patterns are repeated again and again.

Years of these developments passing RDF by. Plan A : The Semantic Web still seems as far in the future as it did 5, 10 years ago. The RDF technologies demonstrably work, and adoption is growing, but it's hardly viral. However you look at it, the world of trendy new specs repeatedly steers around that fact. What's a jaded RDF enthusiast to do? Here's what I recommend:

Exploit the situation!

With a continuous flow of different specs that each covers some little part of data on the Web, focusing on any specific development can only work in the short term. A strategy based on technologies that support flexibility and agility, using known best practices of the truly distributed Web is the best option in the long term, so that systems can be rapidly adapted to meet any new requirements. It doesn't matter that e.g. schema.org misses the point, the data is still useful. "Think globally, act locally" is a great expression - in this context it could mean accept whatever the world of Web 2.0+ has to offer, but handle it on your own terms.

In practice, let's say you're developing a system for a particular vertical market: dog leads (I'm getting serious hints as I type). Don't build the system from scratch based on what people in the dog lead market are doing, don't tie yourself to domain-specific schema or protocols. Wherever possible use commodity, off-the-shelf tools. Then if dog leads take a nose dive on the international market you can regroup with a different target - cowbells for cats - using the same tools, and same skill set. The only parts that need change are at the edges. Basically RDF technologies offer a long-term commercial advantage.

Comments to G+ please.


danja
2011-09-16T14:31:52+01:00
google streams contacts rant federated web semantic semweb activity rdf portable
Related
Comments
Edit

RDF Affordances

Short version : An RDF Affordance is a resource description which gives a client all the information it needs to perform an action.

see RdfAffordances and AffordanceVocabulary.

My last post about what a Data Web Browser might look like led to some fertile discussion on G+. Essentially Mike Amundsen neatly reframed the question to being one about affordances, pointing to a bit of related prior work by him on Hypermedia Types.

We hold this truth to be self-evident, that presented with a simple application scenario a Web Architect will abstract it into a form that will take decades to implement.

Only joking...

Web Intents and Actions

I was initially thinking only in terms of an RDF-oriented browser (plugin/service) but it does make sense to stand back and look at the bigger picture. For starters, while RDF is ideal for describing stuff like service characteristics, there's no compelling reason to limit the data that's being manipulated to RDF. With that door open, there's an immediate tie-in with Web Intents, a JSON/Javascript way of describing/implementing generic interactions like share, edit, view, pick etc. (As it happens I added a Web Intents repository to my todo list a few weeks ago, the idea being to store the descriptions as RDF, providing a minimal API for using them in browsers as others have described - nice bit of serendipitous tie-in).

Tantek has spotted the potential around intents and in Web Actions: Identifying A New Building Block For The Web looks at common features across existing systems like Blog this, Digg, Read later, Follow, Like, Share, Tweet, +1 (he uses "Actions" instead of "Intents" for essentially the same idea).

We hold this truth to be self-evident, that presented with the potential for open-ended innovation a Microformats Geek will start paving cowpaths.

Again, joking...

On the Wiki - RdfAffordances - Mike has brought the abstraction back down to ground with some more detail of RDF-oriented actions, and with a view to hacking an implementation (on my virgin node.js installation) I've started a vocabulary - AffordanceVocabulary - this may change fairly soon, apparently Michael Hausenblas has done a vocab in this area, that'll get precedence if there's overlap/conflict.

We hold this truth to be self-evident, that offered a simple application scenario a Semantic Web Geek will always create a vocabulary that obscures the purpose of the application and that no-one will ever use.

Not entirely joking...

There is one high-level abstraction I've noted on that vocab page that is probably useful. There's a natural boundary between affordances that are essentially just HTTP (e.g. click through link, replace a page) and those which require more complex interations. For now at least I'm calling the former Actions (let me know if there's a better word that doesn't clash with Tantek's usage) - they are around the scope of Mike's Hypermedia Types and the latter Intents - around the scope of Web Intents.

Comments on G+


danja
2011-08-28T13:52:53+01:00
intents json browser web affordances semweb rdf data
Related
Comments
Edit

Data-Oriented Web Browser

Not a new idea, but I thought I'd try and find out how far we've got and braindump a little. I'm making the fairly big assumption that a general-purpose data browser would feasibly useful/usefully feasible in addition to application- or task-specific tools (i.e. use X for your contact/social data, Y for your project management data, Z for your shopping list).

Historically Web browsers provide simple display of (linked) HTML documents obtained via a subset of HTTP, and that's still their primary use. Not very promising for use on the Web of Data without a lot of server-side magic.

But, as well as supporting increasingly sophisted UI elements, they have built-in support for a Turing-complete language, Javascript. The HTTP limitations can be worked around. So while there may still be potential for a totally new breed of data-oriented Web browsers built from scratch as Rich Internet Applications, current browsers have the potential do do whatever's needed. Although they're pretty much limited to playing a client role, in effect they can be whatever kind of Intelligent Agent you like. The bonus is that everyone's already got a browser on their desktop/tablet/mobile - it's an easy path to deployment either for a plugin or better style as code-on-demand.

What's needed for a Data-Oriented Web Browser?

I'm not sure if the Tabulator is still actively maintained (if not, why not!?), but that gave a good indication of the kind of thing that is possible. Taking a step back, the Web of Data is really the same thing as the Semantic Web, and what's new about the Semantic Web isn't the "Semantic" but the "Web" (once again I've lost the source of that quote). How did/do people work with data without the Web? Typically SQL databases and spreadsheets. From those we can lift SQL queries and command-line tools, stored procedures and database forms (this is rather a confession, but back in the day when I first encountered MS Access it blew me away). Then of course there's the spreadsheet UI paradigm, a grid of cells which can be filled with pretty much anything, including most significantly on-the-fly calculated values.

So here's an initial shopping list:

  • an in-memory* graph data structure support (rdfstore-js looks the most advanced right now)
  • a spreadsheet-like view (I bet David Huynh has got stuff like this, if not, how hard could it be with a and jQuery? :)
  • a little language for concisely expressing Web operations, e.g. running SPARQL queries, that could be used inside the spreadsheet (the RDF path-following DSL in Apache Clerezza could be useful here too - link please Henry)
  • tools for building app-specific forms (quite a few tools support custom views of particular classes, e.g. foaf:Person, Fresnel might help here)
  • the ability to write as well as read data (this shouldn't need saying)
  • * persistence would be provided by the Web

    I doubt it's possible to say up front what would be a good user-friendly way of setting this stuff up. But given a bunch of scripts that supported these elements, I reckon with a bit of trial and error dogfood use, within a few iterations something really useful could be possible.

    Thoughts? Volunteers? Startups? :)

    I've still not got commenting set up here so please post any feedback to this Google Plus entry.


    danja
    2011-08-26T10:49:28+01:00
    gui browser ui spreadsheet semweb rdf data linked
    Related
    Comments
    Edit

    Adding SPICE to the Semantic Web

    Main Course

    Here's a circuit:

    distortion circuit
    - and here's its SPICE model:

    ***
    .INCLUDE la-components.mod

    Rsrc 1 0 100E3
    Rin 1 2 1E3
    Rfeed1 2 3 10E3
    Q1 3 0 4 BC109
    Q2 3 0 4 BC179
    Rfeed2 3 4 10E3
    Xopamp 0 2 5 6 4 TL071
    Rload 0 4 10E3
    Vcc 5 0 15
    Vee 6 0 -15

    Vsrc 1 0 SIN(0V .1VPEAK 1KHZ)

    .TRAN 10US 1000US
    ***

    The .INCLUDE is as it sounds, the contents of that file are included in this model. After that it's describing a graph with two kinds of nodes: those associated with a component and connection nodes (i.e. common terminals/points/buses/PCB tracks...). Although the components kind-of contain arcs, they're hidden behind the component's connectors. The component's connectors are identified by their position in the space-separated data. On the schematic the nodes are marked in red.

    Taking the first line:

    Rsrc 1 0 100E3

    This is interpreted via :

    (a Resistor) <name> <node1connection> <node2connection> <value>

    Rsrc is a 100k resistor connected between nodes 1 and 0

    (Node/bus 0 is always ground)

    Taking the first of the transistors:

    Q1 3 0 4 BC109

    a transistor of type BC109 called Q1 has its collector connected to node 3, base to node 0, emitter to node 4

    The .TRAN line is used to run a simulation (a transient analysis), sampling every 10uS for 1000uS. I've not really figured out this side of things properly, couldn't get a straight .DC based transfer chart. But the sine wave will do for now.

    Anyhow I can't go looking at a graph model for long without wondering how it could go on the Web. While there are no doubt loads of ways of doing it, the circuit definition can be transcribed into Turtle fairly directly. Bnodes could be used for the connection buses, but it's just as easy to name them. So making things up as I go along -

    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    @prefix dc: <http://purl.org/dc/elements/1.1/> .
    @prefix spice: <http://purl.org/stuff/spice/> .
    @prefix u: <http://purl.org/stuff/units/> .
    @prefix d: <http://purl.org/stuff/devices/> .
    @base <http://hyperdata.org/circuits/logamp/> .

    <http://hyperdata.org/circuits/logamp> a spice:Circuit ;
    dc:title "Log Amp" ;
    dc:description "a modified log function amplifier" ;
    spice:components ( <Rsrc> <Rin> <Rfeed1> ... <N0> <N1> ...) .

    # Rsrc 1 0 100E3
    <Rsrc> a spice:Resistor ;
    rdfs:label "Rsrc" ;
    spice:terminal1 <N1> ;
    spice:terminal2 <N0> ;
    u:ohms "100000" .

    ...

    that seems ok, now for a transistor:

    # Q1 3 0 4 BC109
    <Q1> a spice:BJT ;
    rdfs:label "Q1" ;
    spice:terminal1 <N3> ;
    spice:terminal2 <N0> ;
    spice:terminal2 <N4> ;
    spice:device d:BC109 .

    that'll do.

    Doing a .INCLUDE in general could really do with something from RDF core (ping RDF WG), but here it's providing other SPICE definitions of the components so it seems reasonable to be more explicit:

    d:BC109 rdfs:isDefinedBy <http://hyperdata.org/circuits/logamp/components#BC109> .

    which given that SPICE supports subcircuits (which is how TL071 is defined) provides a nice composition mechanism.

    I reckon it should be straightforward to write a transformer from SPICE syntax to Turtle. Going the other way, the usual SPARQLing shouldn't be rocket science.

    All seems doable. Homework. Rainy day.

    Starter

    I want to play with analog electronics again, stuff I used to do before the Web came along and ate up my cycles. My motivation now is mostly driven by the price of recording studio equipment. If, for example, I just want to invert the phase of a signal, I'd need to pay say $50+ for a passive DI box or $100+ for a pre-amp. This is a bit demoralising when the components are available for pennies (though hardware like connectors and cases can cost a lot more). Then of course there's the circuit hacking angle, it really is good fun. A project that's a permanent fixture in this space is the distortion pedal (like ghard's Big Muff) - the circuits aren't complicated, but getting a good sound is the Holy Grail, so this is what I'm going to play with first.

    I did buy a bunch of components a while back, but haven't got much in the way of prototyping/test gear. A cheapo USB ADC will hopefully do for a makeshift oscilloscope for now, and I've just ordered the parts to put together a simple PSU (along with *lots* of oddments). But feeling a bit impatient, I thought I'd have a quick look what software was available these days for circuit simulation.

    I don't know if I'm missing something, but things hardly seem to have progressed at all in the last couple of decades (but then again analog electronics hasn't really changed). The de facto standard is SPICE, and there are quite a few tools open source available for using it (ah, things weren't open source back in the day, that's progress). I won't bother linking to the individual bits, if you look for 'spice' in Synaptics a bunch show up, and they all seem to come under the umbrella of gEDA. Anyhow, after an hour or so's fiddling I was able to draw a little circuit using gschem, but I haven't yet managed to get it to generate a working netlist file (which specifies the inter-component connections for SPICE). I think I just need to sit down and check/add all the component attributes. But that's a bit tedious so I've just been playing with a SPICE file manually. Praise be to text formats.

    The first problem here was finding simulation definitions of the components I want to use. The little circuit I want to test includes a common op-amp (TL071) and a pair of transistors, one NPN (BC 109), one PNP (BC 179). Took a lot of searching, and although (allegedly) many of the manufacturers do provide SPICE modules for their components, I eventually found what I needed on hobbiest sites. (Making the component module files doesn't look too difficult, it'd mostly mean copying values from a spec sheet into a SPICE definition - again, sounds tedious).

    There is GUI adapter for running simulations, gspiceui (which I must have another look at now I've got a working model), but with the amount of trial and error I was having to do I settled back into the command-line tools. For future ref. it goes like this:

    ngspice <filename>

    This loads in the file and starts up an interactive shell. Took me a long time to figure out what to do next, but here are a couple of bits that worked for me. Once in the shell:

    ngspice 1 -> run

    Runs the simulation (the .TRANS bit). Then:

    ngspice 2 -> plot V(1) V(4)

    Produces a plot like this:

    distorion plot

    Red is the input (voltage on node 1), blue the output (voltage on node 4).

    Certainly looks like distortion...wonder what it sounds like...

    Pudding

    Finding stuff and looking up references in this space is still fairly Paleolithic, so there's one application of exposing this kind of material as linked (wired!) data. But there are probably stacks of other more inspiring apps. Going totally blue sky, a globally distributed circuit could be rather cool. In the digital realm you for example could have a global computer that's built from just a few simulated gates on each of a million interconnected PCs. Bit like an extremely dumbed-down Web service/agent kind of thing.

    In the analog realm it could get very wacky. Host your own local circuit subsystem, connect it to anyone else's. I guess you'd want to connect your inputs to other folks' outputs and offer outputs of your own. As long as you are limited to connecting your inputs to the rest of the world (or more versatile, I can only connect my output to your input if I have the appropriate rights) then subsystems should play nicely with each other. I see no reason why for control and audio signals you couldn't do this in real-time using existing streaming audio protocols (codec'ing locally to PCM for the instantaneous values).

    This is pretty much what I assume some of the net-based recording systems that are around are doing. I must confess I've never looked into these, trying to mimic traditional recording/mixing stuff that way seems a bit of a non-starter because of the latency issues. But flipping it to a messier, slightly bonkers [insert pun about bipolar transistors] global analog synth kind of idea, then it starts to sound more fun.


    danja
    2011-02-23T22:00:24+01:00
    spice turtle electronics semweb rdf
    Related
    Comments
    Edit

    Some Problems

    Georgi Kobilarov has a refreshing post, suggesting Making Linked Data work isnt the problem. I'm inclined to agree with most of what he says. The technology in itself isn't a solution to any problem, rather an enabler to solve problems. While the idea of serendipity is appealing, it isn't very good justification for a huge global commitment of resources. So what kind of problems do we, as living, social and technological organisms wish to solve?

    To start exploring this space I reckon there are (at least) two general modes of knowledge use. The first is relatively domain-specific, directed by a set of requirements associated with a corresponding set of real-world tasks and operations. These I'd put under the umbrella of Applications, akin to the computer applications we already use but augmented with knowledge engineering facilities and access to the Web of Data. As a shortcut the starting point here is Connolly's Bane: "The bane of my existence is doing things I know the computer could do for me.". But in general it goes far further, in that there are plenty of beneficial things we don't already do. A second mode would be ad hoc, fairly immediate, unplanned, call it Just-in-time problem solving, the kind of thing that we currently turn to search engines for.

    As an example of the Applications mode, one of the early drivers for the Web was e-commerce. I think I'm fairly safe in saying that only the surface of the potential there has been scratched. There's a hint of what can be possible with things like the individual-targetting of Google Ads and Amazon recommendations. In this space the GoodRelations ontology is a marvellous baseline. But what we're not really seeing yet is the whole supply chain from the manufacturer to consumer being integrated. Fairly loosely-coupled (as it is today) In one direction there are the financial aspects ("follow the money"), in the other direction is all the transport, manufacturing and processing that go from raw materials to delivered finished product. Within those different parts of the pipeline there are a whole host of problems relating to technology tied together by human and natural resources.

    Alongside this commercial world there are macroeconomic and macrosocial systems, those areas traditionally covered by government. We're already seeing some movement around transparency with the various government data projects, but I think we're still a very long way from seeing genuinely informed policy and decision making. Reflecting the darker side of advertising right down to commercial spam and taking advantage of general ignorance, good governance is seriously compromised by self-interest (of individuals and corporations) and misinformation. I recently heard a radio programme talking about the UK Conservative Party's successful "Broken Britain" election campaign. An aspect of this was that violent crime was perceived as being on the increase. However the actual statistics suggest that in reality this malaise had actually been declining (see Murder rate lowest for 12 years "Home Office figures show overall crime fell by 5% in England and Wales"). Politicians will always lie, but damage is only done when they get away with it and aren't held to account with the facts. But I don't really want to suggest that prevention of political badness is the goal here, rather the encouragement and facilitation of goodness (man...).

    Another huge area where there are countless problems to solve is science. While the Web has vastly improved information sharing and been a boon to research, I'm not sure the underlying methodologies have changed that much. I'm convinced the open sharing of knowledge at the data level can offer A New Kind of Science (no hyperbole there!).

    There are plenty of other application domains that could benefit from a bit of Web-scale knowledge engineering. Ok, I'll name one more bundle: the Arts.

    Ok, moving on to the Just-in-time mode of problem solving, take a look at the following list (random stuff that came off the top of my head when I woke up this morning). Imagine how you would solve these problems now, and then think how you might solve them with a thousand programmers at your beck and call. Most of them need something considerably deeper than a keyword/linkrank document search. I've dumped this list over on the ESW Wiki, additions and discussion welcome over there (I still haven't implemented comments on this blog, so if you have a comment for anything either mail me or blog it (and mail me) or tweet or use Facebook...).

    • I'd like to upgrade the computer I use for video editing. My budget is about 300 euro. What should I buy?
    • Who should I get to make the soundtrack to my new film?
    • I've bought an Ubuntu laptop to replace my old Apple, I'd like it to run applications that fulfil all the tasks I have on the old machine. What do I need?
    • Should HTML use namespace prefixes?
    • Is there a political motivation behind Royal Weddings?
    • Who should I vote for?
    • Who might make a good (romantic) partner?
    • I wish to sell my double glazing products in sub-Saharan Africa, who should I contact?
    • Who might make a good (business) partner there?
    • I got a mail from someone claiming to be my cousin, asking for a loan. Should I give them the loan?
    • I've got an interesting rash. Should I see a doctor?
    • I wish to enlarge my penis. What method is safe and reliable?

    (Sorry, couldn't resist the last one - but it's a valid example of where you'd need good healthcare data alongside reputation and provenance information)

    PS. danbri points me to a short 1989/90 document which contains a fairly similar list (minus references to genitalia) : Information Management: A Proposal, by a certain Tim Berners-Lee. Go read it. Now!


    danja
    2011-01-22T17:46:57+01:00
    semweb problems rdf
    Related
    Comments
    Edit

    del.icio.us bookmarks to RDF

    The blogosphere seems to think Yahoo! is going to axe del.icio.us so I've knocked together a quick Python script to get my data out - 2317 occasionally annotated bookmarks. To use: make sure you've got Python first (!), download and install BeatifulSoup (navigate to the dir with setup.py, run python setup.py install), download the script and rename it to souper.py, get your del.icio.us bookmarks and rename to delicious.html. Then run python souper.py delicious.html > delicious.ttl and there you have the Turtle.

    I've not checked the output particularly thoroughly, but I think it's ok (one shortcut I made was that any bookmarks that couldn't be converted to ASCII would get ignored). Here's my original bookmarks file and the same data in Turtle (7599 triples).

    The Twitterati seem to be moving en masse to Pinboard, which has a sign-up fee of $7.42 but seems to have got good reviews.


    danja
    2010-12-17T00:24:50+01:00
    script python turtle semweb rdf delicious tags
    Related
    Comments
    Edit

    Slow Data, Decentralization and Semantic Web Architecture

    [I've still got a bug in my blog software which mangles links, so apologies for the ironically unlinky URIs]

    Slow Food (http://en.wikipedia.org/wiki/Slow_Food) is an international movement founded to offer an alternative to fast food, "it strives to preserve traditional and regional cuisine and encourages farming of plants, seeds and livestock characteristic of the local ecosystem". By a little analogical legerdemain, fast data is the kind of stuff you get from regular search engines - quick but not very nutritious, probably bad for you. Slow Data on the other hand has been harvested with care and with attention paid to its preparation. It's far more satisfying in the long run. While complex Semantic Web systems are currently at a slight disadvantage performance-wise (largely due to their youth), there's no reason that high quality data can't be readibly accesible at high speed using existing, well-documented Web techniques. But I'll call it Slow Data anyhow.

    So...I recently got a letter (!) which included a description of a proposed social net application based around RDF data. The author knew what they were talking about and the system sounded good, but they were really struggling with one aspect, how to avoid making a centralised system.

    One of the great rallying cries of the Linked Data movement has been to open data out to the Web. I doubt very much that I've seen a presentation on the subject that hasn't referred to data silos, usually with a predictable image. This antipattern reaches its zenith in applications where the only interface to the data is a dedicated 'snowflake' API (so named because every one is unique), severely limiting the potential for Web-style interconnection (links). Behind the scenes the application implementation may be highly distributed, but all the user or developer can see is a walled garden with a gatekeeper. That's a lot of buzzwords in one paragraph, so I'd better move towards the point.

    How is an RDF triplestore any more open than a SQL-style database hooked up to the Web?
    It might sound heretical, but it isn't, or at least isn't necessarily. The only advantage it has is that by default it uses URIs as identifiers for things (corresponding to the keys in a SQL store) which if designed properly will be dereferenceable over HTTP, i.e. they will be links which can be followed to find out more about the named resources. But SQL-backed Web applications can expose links that can be followed, and many do. (The same goes for NoSQL stores). SPARQL is a query language that can be applied to a particular variety of graphs, but again in itself it isn't really any more webby than the triplestores it addresses. However there is the SPARQL Protocol for RDF (SPROT) http://www.w3.org/TR/rdf-sparql-protocol/ which allow things like a HTTP GET /sparql/?query=EncodedQuery and changes the whole ball game (you don't hear much mention of SPROT, I suppose because of the ugly name and a spec that's mostly WSDL stuff that everyone ignores).

    Hopefully everyone's familiar with Chapter 5 of Fielding's dissertation - http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm - so to cut the waffle I'll cherry-pick one heading: 5.1.4 Cache. If we imagine the Web (of Data) as one huge interlinked information space, then individual stores such as those associated with specific applications can be considered as caches of small chunks of the Web of data. This is probably easiest to conceive by contrasting two different pieces of software. For one let's have a social net app that lets people discover other people with similar interests. It will store data around resources of the type foaf:Person with properties such as foaf:interest and to leverage the social angle foaf:knows. A traditional app for this kind of thing would involve people signing up and entering information about themselves. But quite justifiably a person might say "I don't want to enter loads of stuff into a form in application Y when I already entered it in application X yesterday" (yes, this is the old Data Portability thing). But pause there and for a second piece of software let's have a generic link-follower and data aggregator, i.e. a crawler or bot, or as they're known in FOAF circles, a scutter. It's not difficult to make such things directed, so they only following specific link types of interest (check Slug http://ldodds.com/projects/slug/ - see also https://github.com/ldodds/slug). Let's make the storage system for this scutter a triplestore. Ok, set the scutter going on the Web at large with a plan to follow foaf:Person related links and slurp the data. Come back a few hours later, and you have an already-populated store to which you can plug in the social app, no need for people to sign up (in an ideal world, and ignoring privacy matters).

    Now the scutter plan for this (i.e. get people data) is pretty much isomorphic to a SPARQL query along the lines of:

    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

    CONSTRUCT { ?s ?p ?o } WHERE {
    ?s rdf:type foaf:Person .
    ?s foaf:interest ?o .
    ?s foaf:knows ?o .
    ?s ?p ?o .
    }

    This is exactly the kind of query you'd also want to be asking in the social net app. Going through the scutter, you're asking the Web at large, but because the data has already been aggregated in your store, it doesn't take a thousand GET requests to find relevant statements. But the statements are exactly the same. In other words, an RDF store is just a cache of a small chunk of the Web of Data.

    For performance reasons this kind of cache would be selective in the data collected, so maybe strictly speaking the architecture is more like Uniform Pipe and Filter http://www.ics.uci.edu/~fielding/pubs/dissertation/net_arch_styles.htm#sec_3_2_2 with the uniformity essentially maintained by following the SPARQL and SPROT specs (and 5.1.6 Layered System is probably relevant too).

    This kind of thing is entirely implementable today, in fact the Semantic Web Client Library http://www4.wiwiss.fu-berlin.de/bizer/ng4j/semwebclient/ can do SPARQL queries on the Web at large (SELECT at least, not sure if it supports CONSTRUCT).

    There are other pieces of the Semantic Web toolkit that can be cleanly inserted into Web architecture (as one would hope, given that the Semantic Web is meant to be an extension of the existing Web). For example, a general-purpose WebID setup (FOAF+SSL http://esw.w3.org/WebID) could be inserted between client and server to handle authentication, acting as a proxy and/or gateway.

    Somewhere recently (I think in a paper by danbri and others) I saw discussion about what was needed to get from a Web of Linked Data to a more fully Semantic Web. In other words, even if you score 5 stars at http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/ there might be more you can offer. I might have dreamt it, but I believe the discussion mentioned inference and reasoners. The thing is on the one hand we have lots of linked data already out there, and we already have pretty performant reasoners (e.g. http://clarkparsia.com/pellet/ ) but reasoning over Web-scale data is likely to remain a fantasy. That is, unless you imagine multiple reasoners acting as dedicated, fairly task-specific agents/services over their own manageable little batch of data. These again could be deployed as proxies. For example, another bit of FOAF jargon is smushing, which originally (when people were bnodes) meant the unification of data about a person based on the assumption that the person could be identified by means of their email address or homepage. Since it's more common now to use URIs to identify people (see http://dig.csail.mit.edu/breadcrumbs/node/71) I don't think it's unreasonable to extend the term to cover unification of multiple URIs for a person (typically with owl:sameAs links somewhere). Now going back to the triplestore of the app described above, that's only really interested in statements including the identified foaf:Person, foaf:interest and foaf:knows. There's nothing to stop this treating a person as two individuals if data has been pulled from sites which use their own person ID schemes. But if somewhere else on the Web at large there was a triplestore with reasoning capability that could eat person IDs, foaf:mbox and foaf:homepage data and spit out owl:sameAs statements, this could be used to unify the descriptions for the application. This triplestore could have a scutter as its input and a SPARQL endpoint to provide output, in other words being a uniform pipe kind of proxy.

    Ok, so effectively I'm arguing here that we already have all the bits from which we can glue together a Semantic Web that sits nicely with the Architecture of the World Wide Web http://www.w3.org/TR/webarch/ . But I do think there are at least two specific areas that need attention in the near future. One is in the increased use and optimization for named graphs, especially those of the order of only tens or hundreds of statements. I thought I had a good justification for this, but now my minds gone blank, so just call it a gut feeling. The other thing is in description of datasets - there's already some stuff around annotations and provenance etc, but I'm thinking more in terms of discovery and agents/services being able to advertise themselves to allow a client that's looking for some particular kind of data. the Vocabulary of Interlinked Datasets (voiD, http://vocab.deri.ie/void) is pretty good in this space, but I reckon we need to go a lot further, and have been mulling over a little quasi-protocol for matchmaking between datasets and agents. I'll post more on that once I've got something to talk about...

    There is a teeny bit of low-cost, potentially invaluable data that it'd be nice to see more of. Let's say a directed scutter has crawled the Web and has aggregated all statements of the form <http://example.org/fred> foaf:interest ?x. While ideally it will be placing the triples it's found into named graphs corresponding to the provenance, a more likely coding scenario (because the queries will get silly with thousands of FROMs - hmm, does SPARQL NG do anything about that?) would be to dump everything into a default graph. But, while the full provenance may not be retained is this setup, it can still be made available to consumers of the data if statements of the form <http://example.org/fred> rdfs:seeAlso <http://wherever.com/source/somedataaboutfred> are added to the store. Call it future-proofing.


    danja
    2010-12-14T11:08:17+01:00
    architecture arch semweb rdf
    Related
    Comments
    Edit

    Once more unto the breach (again)

    For the first time in ages I've had a couple of days to sit down and look at code. A lot of it was stuff I hadn't finished, dating back a few years. The typical pattern was either getting distracted from the original aims and playing with the fun stuff or aiming to do so much that I never really got past square one. So this time around I've changed my mind, decided to keep the fun stuff (playing with Agents in Scala) separate from the main app work.

    The main app in mind here is the Semantic Web in a Box idea which I'm back to thinking about in a more minimal form, informed a lot by what Rob wrote on his blog - What people find hard about Linked Data - and the stuff in the Talis tutorial. Basically what I'm after is a very easy-to-use Linked Data editor/visualization tool, with support for some kind of pluggability (TBD). There are existing tools which can do this sort of stuff, but the key here is to keep things as simple as possible (and free and open source). Target users are total beginners and experienced folks that want to be able to knock simple stuff together quickly. There's really not a lot to this, and 'wait long by the river and implementation of your plans will float by' usually works, but no-one really seems to have got around to this thing.

    It'll be a Java/Swing desktop app with the following features:

    • Internal triplestore(s)
    • RDF editor with various views and syntax validation
    • SPARQL editor and results viewer
    • HTTP client (for examining remote resources, crawling and publishing to remote stores/services)
    • HTTP server (for simulating live data)
    • HTTP proxy (for examining headers etc)
    • Basic HTML editor/viewer


    What should also be possible is to run it headless, as a live service.

    Probably more than half the people that read this are likely to have such parts living in their codebases - Java Swing components, Jena, ARQ, and Apache HTTP libs cover an awful lot, the tricky part is wiring them all up in a useful way, with a UI that doesn't confuse.

    I've made a start on gathering together the bits, but I'm unlikely to get down to a good coding session for a while again, so what follows is really notes to self so I don't forget...

    So, RDF editor.

    Currently the main class is org.hyperdata.swing.rdftree.editor.RdfEditor

    One view is a resource-centered thing, based on a JTree backed by a Jena Model. Like everything else here, it's unfinished and very buggy (notably there's something like an out-by-one error on which row expands). But this should give the general idea, the paths should expand indefinitely :

    rdf tree table

    Right now it's only addressing the local model, but it should be reasonably straightforward to hook the HTTP client up to terminal node URIs to go and GET remote data (must check how Tabulator goes about that) and extending the drop-down paths.

    Text views for Turtle and RDF/XML (with crude highlighting from JEditorPanes):

    turtle editor

    xml editor

    I've only just started looking at a graph view (again!), separate from the stuff above - I just hacked at one of the JGraph demos, long way to go:

    The launcher for that is org.hyperdata.swing.graph.danja.GraphEditor


    graph view

    I've stuck the code over here:

    source, wiki etc.


    danja
    2010-11-21T18:47:57+01:00
    swib linkeddata semweb rdf
    Related
    Comments
    Edit

    Piano Piano

    Where I'm staying at the moment I don't have much time to get on the computer, and net access is really lousy. But I've had a lot of chance to think about stuff that I want to do, and have realised that I can feed a few birds with one bean. The blog engine (this) I've been writing in Scala is approaching the basic level of functionality I wanted, so I'm looking again at a couple of old ideas.

    The first is Semantic Web in a Box (new name needed!), the second an agent-based engine that will support scripting (I did a lightning talk about that at one of the SFSW meetups, must see if I can find the slides). Given that Scala actors are perfect for constructing the kind of agents I have in mind, as well as offering a nice way of doing the SemWeb in a Box stuff, I reckon I'll wrap it all together into one project. And the first application built with this setup can be a refactoring of my blog engine...

    Many of the agents probably won't have all these features, but the stereotypical agent I want, a SemWebAgent, will have the following traits:

    • named with a URI
    • access from a HTTP server
    • access to a HTTP client
    • triplestore


    + some code that'll actually do something useful

    Looking from outside, the things will look like regular Web-accessible resources, and can call/be called by external (RESTful) clients/services etc. Internally, if a particular named resource lies within the same VM then more direct messaging is possible. For scripting (when I get around to it), I've got Jython and Rhino (or equivalents) in mind. To support the pluggability of SemWeb in a Box, I'll go for OSGI, probably using Felix as the container.

    I've started coding up the core actor stuff, which I will fill with unit tests as well - being new to Scala I'll no doubt make a lot of mistakes. I'm also putting together some functional tests for the blog engine, which I'll refactor to use this system. I'm already using a tiny bit of Apache Clerezza (for jax-rs handling handling of HTTP calls), I believe there'll be quite a lot more I can cherry-pick.


    danja
    2010-10-10T10:46:05+01:00
    box clerezza gradino semweb rdf
    Related
    Comments
    Edit

    Slides from KRDB 2010

    A week or so ago I was up north in Brixen-Bressanone (definitely "a charming town") at the 3rd KRDB school on Trends in the Web of Data. The programme was exceptionally well contrived, IMHO, seriously apposite for what's going on in the Web of Data. In between beers (don't worry, I am sorting that one out) I did the opening session. My initial brief was (I think) "Semantic Web Platforms". Now I could happily have done the obligatory semweb intro and led into material about the Talis Platform (which is still as far as I know the only one I'd consider a true semweb platform, being provided in a Software as a Service manner via HTTP). But Tom was down to talk about Linked Data (slides) and Martin about the GoodRelations ontology (slides), so I assumed that between them most of those bases would be covered.

    In many real senses the Semantic Web is already a done deal, so all this conspired to give me chance to look at the notion of a platform in general. Naturally I consider the Web of Data to be the key enabler right now, but when it comes to choices on how to use it and application strategies, there I reckon it's worth looking at analogeous systems. So I refactored my title to "Platforms and the Semantic Web" and basically spent 2 hours rambling about my hobbies...

    Slides on slideshare and pdf.

    Many thanks to Enrico, Anja et al for the opportunity. I did stay to poke my nose into the SWAP 2010 goings-on, so caught up with quite a few old faces and met a bunch more new ones. Even made it home in one piece.


    danja
    2010-09-27T08:44:16+01:00
    bressanone krdb semweb rdf slides
    Related
    Comments
    Edit

    Linked Data and Hype

    [in reply to John Sowa on the cg@conceptualgraphs.org list, unfortunately the mail didn't get through - something up with the server]

    I reckon the activities around Linked Data are somewhat different to the typical "Next Big Thing". I'd suggest the NBT here if anything is the Semantic Web, which has suffered from industry hype, and as yet does not live up to the promises. However Linked Data is essentially the same idea as the Semantic Web, but with more emphasis on the "Web" side and less on the "Semantic".

    The central idea of treating the Web conceptually as one big (graph-shaped) database works fine (and the LOD cloud [1] is a notable concrete manifestation), but as you note, most applications do require fast access to relevant data. Some of the more recent RDF stores/SPARQL engines do have performance comparable to traditional RDBs, but I don't think this is entirely relevant to the core paradigm. The tendency in the past has been for the creation of data silos, where each company or organization has their own discrete database. Where data is exposed to the Web it has been in the form of human-readable documents. This makes for a huge impedance mismatch for anyone wishing to use computers to make use of multiple data sources.

    Where data is exposed to the Web as linked data, the material is available for direct recombination and reuse by other parties. When the appropriate standards are used (primarily URIs for identification, RDF for structure and HTTP for transfer) the notion of a database takes on a different form: a triplestore is a (fast) cache of a little chunk of the global Web of data.

    Let's say electricity providers and water providers have their own databases. A company wishing to know where to lay fibre-optic cables would probably want to know where the existing (and planned) wiring/piping lies. Right now that would typically mean they'd need fairly in-depth knowledge of the database schemas and local conventions used by the utility companies. But if the data is available in a consistent form (i.e. RDF) then the work of aligning the source data and extracting the information becomes that much easier. The utilities may still have their own idiosyncratic ways of describing their systems, but then again if they happen to use some common vocabularies (e.g. for geo-location) considerably less expert knowledge of the individual systems is needed to get started. The fibre-optics company could run selective queries (or run a crawler) over the utilities' Web-exposed data, and trivially merge the results in their own, local, performant store.

    The adoption of linked data has to some extent slipped under the radar of industry hype, a good example being http://data.gov.uk, which aims to take (non-personal) UK government data and expose it to the Web in a reusable form. The change in paradigm and increased potential for reuse is pretty apparent when you consider that a lot of the source data is held in Excel spreadsheets or buried in documents. This government-backed project has yielded a couple of surprises - on the one hand the willingness of gov departments to hand over their data and help out (the material being technically publicly available already, for practical reasons that can be far from the case). On the other hand developers have been fairly clamouring to get their hands on the data to build end-user applications.

    (Incidentally, some of the data.gov.uk folks are working on the Linked Data API [2] which provides interfaces to triplestores which don't require any knowledge of RDF or SPARQL, which has traditionally been something of a blocker).


    danja
    2010-08-29T07:00:29+01:00
    linkeddata semweb hype
    Related
    Comments
    Edit