Debunking the 27 Club with SPARQL

This morning I stumbled across a Fortean Times piece about the "27 Club". The story goes that an awful lot of popular musicians have died at the age of 27. A recent new member of the club is Amy Winehouse, and there was a notable cluster with Brian Jones, Jim Morrison, Jimi Hendrix and Janis Joplin all joining around 1970. The idea of the 27 Club appears to have started soon after Kurt Cobain's death in 1994. Rock mythology being what it is, the origin of the 27 Club is now taken as being bluesman Robert Johnson's pact with the Devil (at the crossroads).

So, is there any truth in this? According to a rock star biographer quoted in Wikipedia "there is a statistical spike for musicians who die at 27", but also there's been a British Medical Journal study that showed no such spike. So, contradictory evidence, the jury's still out... But as it happens Wikipedia also has a good collection of data about musicians and that data is available in processing-friendly linked data from dbPedia. So I thought I'd look into this myself.

Long story short, does the highlighted column here look like a spike?

lifespan

I started by finding the Wikipedia page for Kurt Cobain: https://en.wikipedia.org/wiki/Kurt_Cobain. Given that it's easy to get dbPedia's identifier for the man: http://dbpedia.org/resource/Kurt_Cobain. Opening Kurt's URI in a browser results in a redirect to a page about him (following the 303 convention): http://dbpedia.org/page/Kurt_Cobain. It displays the pieces of data dbPedia knows about him, the properties and their values. From that I was able to see how the relevant facts were expressed, and translate them to the following triples in Turtle notation :

PREFIX foaf:

PREFIX ont:

PREFIX db:

PREFIX xsd:

db:Kurt_Cobain a ont:MusicalArtist .

db:Kurt_Cobain foaf:name "Kurt Cobain" .

db:Kurt_Cobain ont:birthDate "1967-02-20"^^xsd:date .

db:Kurt_Cobain ont:deathDate "1994-04-05"^^xsd:date .

This is enough to use as a template for a SPARQL query, putting a variable in place of Kurt's identifier. Given the Robert Johnson story it seems reasonable to filter out any musicians born before the 20th century.

PREFIX foaf:

PREFIX ont:

PREFIX xsd:

SELECT ?name ?birth ?death WHERE {

?m a ont:MusicalArtist ;

foaf:name ?name ;

ont:birthDate ?birth ;

ont:deathDate ?death .

FILTER (?birth > "1900-01-01"^^xsd:date)

}

Running that query produces 4099 results, which seemed small enough to handle in a spreadsheet. Had it been a few more I'd probably have opted for the JSON representation of the results and done the processing with a little script. Had the querying been more complex (likely to cause timeouts on dbPedia) I'd probably have had to do some CONSTRUCT queries to extract the chunks of dbPedia of interest in RDF and put those in a local store, running queries against that. But it wasn't and it wasn't, so I ran the query directly, choosing the XML+XSLT stylesheet option to give me results in HTML. These I simply copied from the browser and pasted into a LibreOffice spreadsheet.

The spreadsheet automatically figured out the date format so I was able to get the musician's ages with a trivial calculation. Sorting on this column revealed that the first 33 entries were duff data, mostly invalid format. Neither Wikipedia nor dbPedia are perfect. But 4066 values, even allowing for a few errors along the way, should be a big enough sample size to test the theory.

You can see here another problem with the data - Kurt has two entries. I guess something like Google Refine could be used to tidy this up, but I went with the assumption that such problems would be reasonably evenly distributed. PS. a DISTINCT qualifier in the SELECT clause in the query would be an improvement like this, and the ?name bit would be better dropped (it's not needed and introduces duplicates). I used the SNORQL endpoint of dbPedia.

Here's an online Google Spreadsheet derived from my LibreOffice original.

So, results. I'll leave the statistical significance measuring to someone else, but to my eyes at least there doesn't seem to be a spike at 27, with only 40 deaths (there's a bigger version of the chart here). If anything, there may be a spike at the top value, 95 deaths at age 74. There may well be a 27 Club of accursed musicians, but the 74 Club is more popular. I don't have the figures for normal humans, but the BMJ found that "musicians in their 20s and 30s were two to three times more likely to die prematurely than the general UK population".

Keith Richards is 68.

Comments to G+ please.


danja
2012-03-07T16:21:58+01:00
linkeddata 27club sparql rdf data linked dbpedia journalism
Related
Comments
Edit

Consolidation

A little follow-up to my post Everyone has a Graph Store. Two main things: looking at those graphs from a different perspective and a little initiative I'm putting forward to try and advance a particular aspect of this stuff. (PS. I've gone on about the first point a lot longer than intended and the dogs need walking, so I'll leave the second thing for another day - in lieu of that check SPARQL Box).

"Graphs" are just Structured Data

Given the response I got on twitter, G+ etc. there must have been something right about that post, but the most interesting feedback I got relates to what was wrong with it. Specifically from Kingsley Idehen (@kidehen) :

do the people we need to engage really care about the facts that they've been using 'Graphs' forever? I don't think so. Why not remind them of the fact that they've been working with structured data forever, but in silos prior to the emergence of the ubiquitous Web.

I was bandwaggoning the graph meme, in the sense of the Social Graph that's been talked about a lot in recent years, along with things like Tim Berners-Lee's description of the WWW as the Giant Global Graph. I also had in mind the concrete notion of the graph as found in RDF. But Kingsley's absolutely right to point out that what we're talking about here is really just structured data and how we use it.

I'll borrow a little from Kingsley's own history to help clarify the point. Go back two decades and you'll find Kingsley starting a company (which became OpenLink) focused on data integration sofware. Their products were middleware that allow connections to be made between various kinds of enterprise databases and applications. They were based on industry standards, allowing pluggability between systems (acronym city: SQL, XML, ODBC, JDBC, OLE, ADO...). Kingsley had recognised there was a market for this stuff because, in essence, being able to connect different systems together significantly increased the value and utility of those systems - the whole being greater than the sum of parts. Fast-forward to say a decade ago, and a new kind of data integration was becoming feasible - using the Web. Rather than using standards designed for connecting specific enterprise tools together, this exploited open, global standards, notably URLs and HTTP. While XML was (and is) useful for this purpose (and HTML also has its uses), the emerging Resource Description Framework has Web techologies as its foundations, so is ideally suited for integrating data in this environment. Seeing the advantages of using not only Web technologies as middleware but also the Web as a database in its own right, Kinsgley ensured his company was an early adopter and they've been at the forefront of the development of linked data ever since.

But there's a lot more to this than enterprise databases.

Local Structured Data

Every time we use a computer we are working with structured data. Even if it's just Word documents on a file system, there are relationships and interactions between the pieces of information we're working with. Take a look at your Start Menu or whatever the OS X Toolbar is called: every one of the applications there uses data in a structured fashion. While there will be some system-wide integration of their data, e.g. in allowing intelligent search, essentially each application operates in it's own little isolated world.

Back to the Web again and we see all the different companies, services and application operating in a similar fashion, commonly referred to as data silos. But the take home here, as Kingsley puts it, is that we've all been using structured data forever. The challenge for the next generation of software, whether we interact with it on our cell phone, laptop, desktop, domestic appliance or the Web is genuine integration. The best integration capability we have to date is through Web technologies.

Here I'll quote Kingsley again (from G+). He's talking in the context of linked data advocacy, but the point he makes is a much broader, practical one:

Basically, we should be demonstrating 'Linked Data Inside' effects on existing apps (Access, File Maker, Excel, Google Spreadsheet etc..). Here's the the pleasant surprise and one of my eternal Linked Data frustrations: each of the native tools above have natural bindings to Linked Data courtesy of:

1. HTTP GET support -- so each Linked Data Resource URL is a Data Source Name, easily comparable to an ODBC/JDBC Data Source Name

2. CSV output support -- meaning to make 3-tuples or 4-tuples and then save to a Text file that practically N-Triples .

Let's take this opportunity to collectively fix the broken Linked Data narrative. Fixing that will also enable critical fixes to the broken Semantic Web narrative. Everything is a Remix, but Linked Data (the ultimate remix technology) is described or pitched as the ultimate remix facilitator.

More generally, in other words, the future is already here (it's just not very evenly distributed). Referring back to my previous blog post, you can legitimately search & replace "Graph" with "Structured Data".


danja
2012-03-01T14:52:17+01:00
box kidehen sparql rdf data linked
Related
Comments
Edit

Scutter's Mate

As I was admiring the Linked Open Vocabularies Endpoint (LOV-E) it occurred to me that the vocabs I maintain (well, create and forget...) aren't particularly discoverable. Even before saying they're vocabs, there's not necessarily anything linking in to them (yes, really forget). Ideally I suppose I should put together a proper Semantic Sitemap, but for now I've thrown together a quick and dirty directory walking script in Python: scutters-mate.py. It produces a Turtle listing of the RDF files it finds (by filename extension) containing entries like this:

<http://hyperdata.org/xmlns/meta.ttl>  rdfs:seeAlso <dogmood/index.ttl> .
<dogmood/index.ttl> rdfs:seeAlso <http://hyperdata.org/xmlns/meta.ttl> .
<dogmood/index.ttl> format:format <http://purl.org/stuff/formats/text/turtle> ;
rdfs:label "text/turtle" .

Here I ran it in the /xmlns directory and saved the output to xmlns/meta.ttl.

I'm thinking I'll also run it from the root of all the domains I use, then try and remember to link to /meta.ttl wherever appropriate to give the scutters a helping hand.

Comments (G+)


danja
2012-01-04T20:16:46+01:00
sitemap scutter vocabs rdf data linked
Related
Comments
Edit

Data-Oriented Web Browser

Not a new idea, but I thought I'd try and find out how far we've got and braindump a little. I'm making the fairly big assumption that a general-purpose data browser would feasibly useful/usefully feasible in addition to application- or task-specific tools (i.e. use X for your contact/social data, Y for your project management data, Z for your shopping list).

Historically Web browsers provide simple display of (linked) HTML documents obtained via a subset of HTTP, and that's still their primary use. Not very promising for use on the Web of Data without a lot of server-side magic.

But, as well as supporting increasingly sophisted UI elements, they have built-in support for a Turing-complete language, Javascript. The HTTP limitations can be worked around. So while there may still be potential for a totally new breed of data-oriented Web browsers built from scratch as Rich Internet Applications, current browsers have the potential do do whatever's needed. Although they're pretty much limited to playing a client role, in effect they can be whatever kind of Intelligent Agent you like. The bonus is that everyone's already got a browser on their desktop/tablet/mobile - it's an easy path to deployment either for a plugin or better style as code-on-demand.

What's needed for a Data-Oriented Web Browser?

I'm not sure if the Tabulator is still actively maintained (if not, why not!?), but that gave a good indication of the kind of thing that is possible. Taking a step back, the Web of Data is really the same thing as the Semantic Web, and what's new about the Semantic Web isn't the "Semantic" but the "Web" (once again I've lost the source of that quote). How did/do people work with data without the Web? Typically SQL databases and spreadsheets. From those we can lift SQL queries and command-line tools, stored procedures and database forms (this is rather a confession, but back in the day when I first encountered MS Access it blew me away). Then of course there's the spreadsheet UI paradigm, a grid of cells which can be filled with pretty much anything, including most significantly on-the-fly calculated values.

So here's an initial shopping list:

  • an in-memory* graph data structure support (rdfstore-js looks the most advanced right now)
  • a spreadsheet-like view (I bet David Huynh has got stuff like this, if not, how hard could it be with a and jQuery? :)
  • a little language for concisely expressing Web operations, e.g. running SPARQL queries, that could be used inside the spreadsheet (the RDF path-following DSL in Apache Clerezza could be useful here too - link please Henry)
  • tools for building app-specific forms (quite a few tools support custom views of particular classes, e.g. foaf:Person, Fresnel might help here)
  • the ability to write as well as read data (this shouldn't need saying)
  • * persistence would be provided by the Web

    I doubt it's possible to say up front what would be a good user-friendly way of setting this stuff up. But given a bunch of scripts that supported these elements, I reckon with a bit of trial and error dogfood use, within a few iterations something really useful could be possible.

    Thoughts? Volunteers? Startups? :)

    I've still not got commenting set up here so please post any feedback to this Google Plus entry.


    danja
    2011-08-26T10:49:28+01:00
    gui browser ui spreadsheet semweb rdf data linked
    Related
    Comments
    Edit