Debunking the 27 Club with SPARQL

This morning I stumbled across a Fortean Times piece about the "27 Club". The story goes that an awful lot of popular musicians have died at the age of 27. A recent new member of the club is Amy Winehouse, and there was a notable cluster with Brian Jones, Jim Morrison, Jimi Hendrix and Janis Joplin all joining around 1970. The idea of the 27 Club appears to have started soon after Kurt Cobain's death in 1994. Rock mythology being what it is, the origin of the 27 Club is now taken as being bluesman Robert Johnson's pact with the Devil (at the crossroads).

So, is there any truth in this? According to a rock star biographer quoted in Wikipedia "there is a statistical spike for musicians who die at 27", but also there's been a British Medical Journal study that showed no such spike. So, contradictory evidence, the jury's still out... But as it happens Wikipedia also has a good collection of data about musicians and that data is available in processing-friendly linked data from dbPedia. So I thought I'd look into this myself.

Long story short, does the highlighted column here look like a spike?

lifespan

I started by finding the Wikipedia page for Kurt Cobain: https://en.wikipedia.org/wiki/Kurt_Cobain. Given that it's easy to get dbPedia's identifier for the man: http://dbpedia.org/resource/Kurt_Cobain. Opening Kurt's URI in a browser results in a redirect to a page about him (following the 303 convention): http://dbpedia.org/page/Kurt_Cobain. It displays the pieces of data dbPedia knows about him, the properties and their values. From that I was able to see how the relevant facts were expressed, and translate them to the following triples in Turtle notation :

PREFIX foaf:

PREFIX ont:

PREFIX db:

PREFIX xsd:

db:Kurt_Cobain a ont:MusicalArtist .

db:Kurt_Cobain foaf:name "Kurt Cobain" .

db:Kurt_Cobain ont:birthDate "1967-02-20"^^xsd:date .

db:Kurt_Cobain ont:deathDate "1994-04-05"^^xsd:date .

This is enough to use as a template for a SPARQL query, putting a variable in place of Kurt's identifier. Given the Robert Johnson story it seems reasonable to filter out any musicians born before the 20th century.

PREFIX foaf:

PREFIX ont:

PREFIX xsd:

SELECT ?name ?birth ?death WHERE {

?m a ont:MusicalArtist ;

foaf:name ?name ;

ont:birthDate ?birth ;

ont:deathDate ?death .

FILTER (?birth > "1900-01-01"^^xsd:date)

}

Running that query produces 4099 results, which seemed small enough to handle in a spreadsheet. Had it been a few more I'd probably have opted for the JSON representation of the results and done the processing with a little script. Had the querying been more complex (likely to cause timeouts on dbPedia) I'd probably have had to do some CONSTRUCT queries to extract the chunks of dbPedia of interest in RDF and put those in a local store, running queries against that. But it wasn't and it wasn't, so I ran the query directly, choosing the XML+XSLT stylesheet option to give me results in HTML. These I simply copied from the browser and pasted into a LibreOffice spreadsheet.

The spreadsheet automatically figured out the date format so I was able to get the musician's ages with a trivial calculation. Sorting on this column revealed that the first 33 entries were duff data, mostly invalid format. Neither Wikipedia nor dbPedia are perfect. But 4066 values, even allowing for a few errors along the way, should be a big enough sample size to test the theory.

You can see here another problem with the data - Kurt has two entries. I guess something like Google Refine could be used to tidy this up, but I went with the assumption that such problems would be reasonably evenly distributed. PS. a DISTINCT qualifier in the SELECT clause in the query would be an improvement like this, and the ?name bit would be better dropped (it's not needed and introduces duplicates). I used the SNORQL endpoint of dbPedia.

Here's an online Google Spreadsheet derived from my LibreOffice original.

So, results. I'll leave the statistical significance measuring to someone else, but to my eyes at least there doesn't seem to be a spike at 27, with only 40 deaths (there's a bigger version of the chart here). If anything, there may be a spike at the top value, 95 deaths at age 74. There may well be a 27 Club of accursed musicians, but the 74 Club is more popular. I don't have the figures for normal humans, but the BMJ found that "musicians in their 20s and 30s were two to three times more likely to die prematurely than the general UK population".

Keith Richards is 68.

Comments to G+ please.


danja
2012-03-07T16:21:58+01:00
linkeddata 27club sparql rdf data linked dbpedia journalism
Related
Comments
Edit

Consolidation

A little follow-up to my post Everyone has a Graph Store. Two main things: looking at those graphs from a different perspective and a little initiative I'm putting forward to try and advance a particular aspect of this stuff. (PS. I've gone on about the first point a lot longer than intended and the dogs need walking, so I'll leave the second thing for another day - in lieu of that check SPARQL Box).

"Graphs" are just Structured Data

Given the response I got on twitter, G+ etc. there must have been something right about that post, but the most interesting feedback I got relates to what was wrong with it. Specifically from Kingsley Idehen (@kidehen) :

do the people we need to engage really care about the facts that they've been using 'Graphs' forever? I don't think so. Why not remind them of the fact that they've been working with structured data forever, but in silos prior to the emergence of the ubiquitous Web.

I was bandwaggoning the graph meme, in the sense of the Social Graph that's been talked about a lot in recent years, along with things like Tim Berners-Lee's description of the WWW as the Giant Global Graph. I also had in mind the concrete notion of the graph as found in RDF. But Kingsley's absolutely right to point out that what we're talking about here is really just structured data and how we use it.

I'll borrow a little from Kingsley's own history to help clarify the point. Go back two decades and you'll find Kingsley starting a company (which became OpenLink) focused on data integration sofware. Their products were middleware that allow connections to be made between various kinds of enterprise databases and applications. They were based on industry standards, allowing pluggability between systems (acronym city: SQL, XML, ODBC, JDBC, OLE, ADO...). Kingsley had recognised there was a market for this stuff because, in essence, being able to connect different systems together significantly increased the value and utility of those systems - the whole being greater than the sum of parts. Fast-forward to say a decade ago, and a new kind of data integration was becoming feasible - using the Web. Rather than using standards designed for connecting specific enterprise tools together, this exploited open, global standards, notably URLs and HTTP. While XML was (and is) useful for this purpose (and HTML also has its uses), the emerging Resource Description Framework has Web techologies as its foundations, so is ideally suited for integrating data in this environment. Seeing the advantages of using not only Web technologies as middleware but also the Web as a database in its own right, Kinsgley ensured his company was an early adopter and they've been at the forefront of the development of linked data ever since.

But there's a lot more to this than enterprise databases.

Local Structured Data

Every time we use a computer we are working with structured data. Even if it's just Word documents on a file system, there are relationships and interactions between the pieces of information we're working with. Take a look at your Start Menu or whatever the OS X Toolbar is called: every one of the applications there uses data in a structured fashion. While there will be some system-wide integration of their data, e.g. in allowing intelligent search, essentially each application operates in it's own little isolated world.

Back to the Web again and we see all the different companies, services and application operating in a similar fashion, commonly referred to as data silos. But the take home here, as Kingsley puts it, is that we've all been using structured data forever. The challenge for the next generation of software, whether we interact with it on our cell phone, laptop, desktop, domestic appliance or the Web is genuine integration. The best integration capability we have to date is through Web technologies.

Here I'll quote Kingsley again (from G+). He's talking in the context of linked data advocacy, but the point he makes is a much broader, practical one:

Basically, we should be demonstrating 'Linked Data Inside' effects on existing apps (Access, File Maker, Excel, Google Spreadsheet etc..). Here's the the pleasant surprise and one of my eternal Linked Data frustrations: each of the native tools above have natural bindings to Linked Data courtesy of:

1. HTTP GET support -- so each Linked Data Resource URL is a Data Source Name, easily comparable to an ODBC/JDBC Data Source Name

2. CSV output support -- meaning to make 3-tuples or 4-tuples and then save to a Text file that practically N-Triples .

Let's take this opportunity to collectively fix the broken Linked Data narrative. Fixing that will also enable critical fixes to the broken Semantic Web narrative. Everything is a Remix, but Linked Data (the ultimate remix technology) is described or pitched as the ultimate remix facilitator.

More generally, in other words, the future is already here (it's just not very evenly distributed). Referring back to my previous blog post, you can legitimately search & replace "Graph" with "Structured Data".


danja
2012-03-01T14:52:17+01:00
box kidehen sparql rdf data linked
Related
Comments
Edit

Knobs

Quantitative Filtering

Filtering is a core feature of information presentation on the Web. As an example, look at blogs. This post will visible for a while on this blog's front page, along with the other most recent posts. Essentially the page is defined by a filter (by date) applied to a large collection of material. Filters can work over many different axes, e.g. date, tag, author etc. They can be combined to provide faceted views of the information. Filters like this are fairly common on the Web, often seen combined with an ordering of the data for example: Sort By Price and Show 10 Items Per Page.

Many filters operate over a continuous variable (or one that can be mapped to a continuum), the date of posts being a good example. If you've got a continuous variable then a UI component that becomes available is the knob or slider. It's pretty straightforward to hook such a UI component up to a filter to apply to a backend store (in fact I rigged up a demo of doing this not long ago).

In the context of a blog there is quite a lot of data available that could be used for a knob-controlled filter. For example, most blogs contain a mix of long and short posts. Why not filter on word count? More nuanced things are possible nearby, say link count. Or something like readabilty. To make knobs or sliders user-friendly you probably wouldn't want to offer the viewer the actual numbers of word or links, rather a e.g. slider that had at its extremes Short ---|- Long.

Quantative Tagging

Sites like Amazon also exploit user-contributed data like rating (in reviews). But there's an awful lot more potential kinds of information available. To pick some at random: utility, creativity, authority, entertainment value (fun!). So someone comes along and sees a post with a set of sliders below it and sets those sliders at 4, 3, 2, 1. That data is passed back to the server. Ideally the server will store that data associated with the user in question, to allow the whole social query dimension. Or the value may simply be aggregated as numbers associated with the post.

When someone else arrives on the site they see, say, a default view of the most recent posts. But below are same controls again. They are interested in reading material that's useful and fun, but are less interested in the other factors. So they set the sliders at 5, 1, 1, 5 and click Search. They are instantly presented with posts that fit that profile. The user may want to save that profile so it's the default next time they visit.

What happens under the hood at view time is again something that could be quick & dirty less-than/greater-than filtering on the parameters, or something more sophisticated that derived the results from the "shape" of the settings, amplifying the descriptions previously given by their friends.

Taking the user-contributed data angle a step further, instead of having a predetermined set of controls, it wouldn't be hard (at least if you're using RDF under the hood :) to allow the users to define new sliders, just ask them for the axis over which the slider varies Foo...Bar. Working title, please change: Wiki Knobs.

Interstices

The applications for knobs like these are pretty open-ended. What I describe above is a typical-Web-site-oriented view of an idea my late wife Caroline suggested years ago, effectively an idea generation machine, working title Interstices. I'm not sure she was a big fan of the Surrealist movement per se, but she loved seeing surprising, apparently contradictory concepts combined in art. I vaguely remember (or have imagined :) her talking about it the context of a magazine advert for PCs, where the tower cases had Friesian cattle's black and white markings (anyone remember that?).

With Interstices you'd tag images with sliders as described above, arbitrary scales, say Hard...Soft, Natural...Artificial etc. But then rather than looking for matches, the system would offer you opposites, so the PC is Hard...Soft:1, Natural...Artificial:5 whereas the cow is maybe Hard...Soft:4, Natural...Artificial:1.

I said at the time it would be easy to build - still haven't got around to it. But I'm pretty sure at the time we were only thinking in terms of a little app one or two people might use. Imagine something like this supporting proper crowdsourcing, e.g. sliders attached to Flickr. That'd be cool.

Comments to G+ please


danja
2012-02-12T16:18:07+01:00
sliders ideas interstices knobs social creativity gui filter interaction ui rdf data
Related
Comments
Edit

Small Data

I'd just like to plant a little flag in the sand. Big Data seems to be the flavour of the month (and is undeniably extremely useful and interesting), but I've a gut feeling that might be symptomatic of not seeing the wood for the trees (or maybe vice versa).

I've not thought this through much, but surely any trends/correlations/relationships that are important enough to be of interest should be detectable without having to build a terabyte+ store? Rather that trying to capture as much raw data as possible up front, I suspect a more productive approach long-term will be to work with (maybe federated) crawler farms, with lots and lots of algorithms running in parallel over what they see. If there are appropriate training feedback loops in place, the shape of algorithms themselves could be treated as the results of the analysis.

It could be argued that once you have accumulated a corpus of raw data you can subsequently throw whatever you like at it without having to get the raw data again. But that corpus will never be complete or truly fresh - as new data appears on the Web all the time. More critically, under normal circustances you can never be sure you've got a dataset that contains a good sample representation covering whatever unknowns you're exploring. But crawlers can be directed to favour slices of the Web that contain information relevant to your hypotheses.

So, in the context of the Web, the Web itself should be the only big data needed. Which gives a neat parallel in the other sciences: reality itself is the only database you'll ever need :)

Ok, in the same way that Big Sites (like Wikipedia/dbPedia) adds big value to the Web alongside lots of small pieces, loosely joined, the same no doubt goes for Big Data. But let's not forget the vice versa, a complementary Small Data approach.

Somewhat orthogonal to this, one way in which the Web is a game changer for data is that here the relationship between pieces of data (/documents) is at least as significant as those pieces of data stacked on top of each other. Link Rank is a special case, an aggregated, flattened view of link value. If topics and entities (i.e. thing in general, people, places, concepts etc) and their interrelationships are inferred and/or explicitly named, it should expose some interesting facets of how human knowledge works.

Comment to G+ please.


danja
2012-01-30T10:04:06+01:00
algorithms federated ai science rdf data
Related
Comments
Edit

Scutter's Mate

As I was admiring the Linked Open Vocabularies Endpoint (LOV-E) it occurred to me that the vocabs I maintain (well, create and forget...) aren't particularly discoverable. Even before saying they're vocabs, there's not necessarily anything linking in to them (yes, really forget). Ideally I suppose I should put together a proper Semantic Sitemap, but for now I've thrown together a quick and dirty directory walking script in Python: scutters-mate.py. It produces a Turtle listing of the RDF files it finds (by filename extension) containing entries like this:

<http://hyperdata.org/xmlns/meta.ttl>  rdfs:seeAlso <dogmood/index.ttl> .
<dogmood/index.ttl> rdfs:seeAlso <http://hyperdata.org/xmlns/meta.ttl> .
<dogmood/index.ttl> format:format <http://purl.org/stuff/formats/text/turtle> ;
rdfs:label "text/turtle" .

Here I ran it in the /xmlns directory and saved the output to xmlns/meta.ttl.

I'm thinking I'll also run it from the root of all the domains I use, then try and remember to link to /meta.ttl wherever appropriate to give the scutters a helping hand.

Comments (G+)


danja
2012-01-04T20:16:46+01:00
sitemap scutter vocabs rdf data linked
Related
Comments
Edit

RDF Affordances

Short version : An RDF Affordance is a resource description which gives a client all the information it needs to perform an action.

see RdfAffordances and AffordanceVocabulary.

My last post about what a Data Web Browser might look like led to some fertile discussion on G+. Essentially Mike Amundsen neatly reframed the question to being one about affordances, pointing to a bit of related prior work by him on Hypermedia Types.

We hold this truth to be self-evident, that presented with a simple application scenario a Web Architect will abstract it into a form that will take decades to implement.

Only joking...

Web Intents and Actions

I was initially thinking only in terms of an RDF-oriented browser (plugin/service) but it does make sense to stand back and look at the bigger picture. For starters, while RDF is ideal for describing stuff like service characteristics, there's no compelling reason to limit the data that's being manipulated to RDF. With that door open, there's an immediate tie-in with Web Intents, a JSON/Javascript way of describing/implementing generic interactions like share, edit, view, pick etc. (As it happens I added a Web Intents repository to my todo list a few weeks ago, the idea being to store the descriptions as RDF, providing a minimal API for using them in browsers as others have described - nice bit of serendipitous tie-in).

Tantek has spotted the potential around intents and in Web Actions: Identifying A New Building Block For The Web looks at common features across existing systems like Blog this, Digg, Read later, Follow, Like, Share, Tweet, +1 (he uses "Actions" instead of "Intents" for essentially the same idea).

We hold this truth to be self-evident, that presented with the potential for open-ended innovation a Microformats Geek will start paving cowpaths.

Again, joking...

On the Wiki - RdfAffordances - Mike has brought the abstraction back down to ground with some more detail of RDF-oriented actions, and with a view to hacking an implementation (on my virgin node.js installation) I've started a vocabulary - AffordanceVocabulary - this may change fairly soon, apparently Michael Hausenblas has done a vocab in this area, that'll get precedence if there's overlap/conflict.

We hold this truth to be self-evident, that offered a simple application scenario a Semantic Web Geek will always create a vocabulary that obscures the purpose of the application and that no-one will ever use.

Not entirely joking...

There is one high-level abstraction I've noted on that vocab page that is probably useful. There's a natural boundary between affordances that are essentially just HTTP (e.g. click through link, replace a page) and those which require more complex interations. For now at least I'm calling the former Actions (let me know if there's a better word that doesn't clash with Tantek's usage) - they are around the scope of Mike's Hypermedia Types and the latter Intents - around the scope of Web Intents.

Comments on G+


danja
2011-08-28T13:52:53+01:00
intents json browser web affordances semweb rdf data
Related
Comments
Edit

Data-Oriented Web Browser

Not a new idea, but I thought I'd try and find out how far we've got and braindump a little. I'm making the fairly big assumption that a general-purpose data browser would feasibly useful/usefully feasible in addition to application- or task-specific tools (i.e. use X for your contact/social data, Y for your project management data, Z for your shopping list).

Historically Web browsers provide simple display of (linked) HTML documents obtained via a subset of HTTP, and that's still their primary use. Not very promising for use on the Web of Data without a lot of server-side magic.

But, as well as supporting increasingly sophisted UI elements, they have built-in support for a Turing-complete language, Javascript. The HTTP limitations can be worked around. So while there may still be potential for a totally new breed of data-oriented Web browsers built from scratch as Rich Internet Applications, current browsers have the potential do do whatever's needed. Although they're pretty much limited to playing a client role, in effect they can be whatever kind of Intelligent Agent you like. The bonus is that everyone's already got a browser on their desktop/tablet/mobile - it's an easy path to deployment either for a plugin or better style as code-on-demand.

What's needed for a Data-Oriented Web Browser?

I'm not sure if the Tabulator is still actively maintained (if not, why not!?), but that gave a good indication of the kind of thing that is possible. Taking a step back, the Web of Data is really the same thing as the Semantic Web, and what's new about the Semantic Web isn't the "Semantic" but the "Web" (once again I've lost the source of that quote). How did/do people work with data without the Web? Typically SQL databases and spreadsheets. From those we can lift SQL queries and command-line tools, stored procedures and database forms (this is rather a confession, but back in the day when I first encountered MS Access it blew me away). Then of course there's the spreadsheet UI paradigm, a grid of cells which can be filled with pretty much anything, including most significantly on-the-fly calculated values.

So here's an initial shopping list:

  • an in-memory* graph data structure support (rdfstore-js looks the most advanced right now)
  • a spreadsheet-like view (I bet David Huynh has got stuff like this, if not, how hard could it be with a and jQuery? :)
  • a little language for concisely expressing Web operations, e.g. running SPARQL queries, that could be used inside the spreadsheet (the RDF path-following DSL in Apache Clerezza could be useful here too - link please Henry)
  • tools for building app-specific forms (quite a few tools support custom views of particular classes, e.g. foaf:Person, Fresnel might help here)
  • the ability to write as well as read data (this shouldn't need saying)
  • * persistence would be provided by the Web

    I doubt it's possible to say up front what would be a good user-friendly way of setting this stuff up. But given a bunch of scripts that supported these elements, I reckon with a bit of trial and error dogfood use, within a few iterations something really useful could be possible.

    Thoughts? Volunteers? Startups? :)

    I've still not got commenting set up here so please post any feedback to this Google Plus entry.


    danja
    2011-08-26T10:49:28+01:00
    gui browser ui spreadsheet semweb rdf data linked
    Related
    Comments
    Edit