Debunking the 27 Club with SPARQL

This morning I stumbled across a Fortean Times piece about the "27 Club". The story goes that an awful lot of popular musicians have died at the age of 27. A recent new member of the club is Amy Winehouse, and there was a notable cluster with Brian Jones, Jim Morrison, Jimi Hendrix and Janis Joplin all joining around 1970. The idea of the 27 Club appears to have started soon after Kurt Cobain's death in 1994. Rock mythology being what it is, the origin of the 27 Club is now taken as being bluesman Robert Johnson's pact with the Devil (at the crossroads).

So, is there any truth in this? According to a rock star biographer quoted in Wikipedia "there is a statistical spike for musicians who die at 27", but also there's been a British Medical Journal study that showed no such spike. So, contradictory evidence, the jury's still out... But as it happens Wikipedia also has a good collection of data about musicians and that data is available in processing-friendly linked data from dbPedia. So I thought I'd look into this myself.

Long story short, does the highlighted column here look like a spike?

lifespan

I started by finding the Wikipedia page for Kurt Cobain: https://en.wikipedia.org/wiki/Kurt_Cobain. Given that it's easy to get dbPedia's identifier for the man: http://dbpedia.org/resource/Kurt_Cobain. Opening Kurt's URI in a browser results in a redirect to a page about him (following the 303 convention): http://dbpedia.org/page/Kurt_Cobain. It displays the pieces of data dbPedia knows about him, the properties and their values. From that I was able to see how the relevant facts were expressed, and translate them to the following triples in Turtle notation :

PREFIX foaf:

PREFIX ont:

PREFIX db:

PREFIX xsd:

db:Kurt_Cobain a ont:MusicalArtist .

db:Kurt_Cobain foaf:name "Kurt Cobain" .

db:Kurt_Cobain ont:birthDate "1967-02-20"^^xsd:date .

db:Kurt_Cobain ont:deathDate "1994-04-05"^^xsd:date .

This is enough to use as a template for a SPARQL query, putting a variable in place of Kurt's identifier. Given the Robert Johnson story it seems reasonable to filter out any musicians born before the 20th century.

PREFIX foaf:

PREFIX ont:

PREFIX xsd:

SELECT ?name ?birth ?death WHERE {

?m a ont:MusicalArtist ;

foaf:name ?name ;

ont:birthDate ?birth ;

ont:deathDate ?death .

FILTER (?birth > "1900-01-01"^^xsd:date)

}

Running that query produces 4099 results, which seemed small enough to handle in a spreadsheet. Had it been a few more I'd probably have opted for the JSON representation of the results and done the processing with a little script. Had the querying been more complex (likely to cause timeouts on dbPedia) I'd probably have had to do some CONSTRUCT queries to extract the chunks of dbPedia of interest in RDF and put those in a local store, running queries against that. But it wasn't and it wasn't, so I ran the query directly, choosing the XML+XSLT stylesheet option to give me results in HTML. These I simply copied from the browser and pasted into a LibreOffice spreadsheet.

The spreadsheet automatically figured out the date format so I was able to get the musician's ages with a trivial calculation. Sorting on this column revealed that the first 33 entries were duff data, mostly invalid format. Neither Wikipedia nor dbPedia are perfect. But 4066 values, even allowing for a few errors along the way, should be a big enough sample size to test the theory.

You can see here another problem with the data - Kurt has two entries. I guess something like Google Refine could be used to tidy this up, but I went with the assumption that such problems would be reasonably evenly distributed. PS. a DISTINCT qualifier in the SELECT clause in the query would be an improvement like this, and the ?name bit would be better dropped (it's not needed and introduces duplicates). I used the SNORQL endpoint of dbPedia.

Here's an online Google Spreadsheet derived from my LibreOffice original.

So, results. I'll leave the statistical significance measuring to someone else, but to my eyes at least there doesn't seem to be a spike at 27, with only 40 deaths (there's a bigger version of the chart here). If anything, there may be a spike at the top value, 95 deaths at age 74. There may well be a 27 Club of accursed musicians, but the 74 Club is more popular. I don't have the figures for normal humans, but the BMJ found that "musicians in their 20s and 30s were two to three times more likely to die prematurely than the general UK population".

Keith Richards is 68.

Comments to G+ please.


danja
2012-03-07T16:21:58+01:00
linkeddata 27club sparql rdf data linked dbpedia journalism
Related
Comments
Edit

Schema/Vocab Mapping toolkit

Olaf Hartig has pointed me to the R2R Framework :

[[

The R2R Framework enables Linked Data applications which discover data on the Web, that is represented using unknown terms, to search the Web for mappings and apply the discovered mappings to translate Web data to the application's target vocabulary. The R2R Framework is aimed to be used by Linked Data publishers, vocabulary maintainers and Linked Data application developers. It support them by:

1. providing the R2R Mapping Language for publishing fine-grained term mappings on the Web

2. defining best-practices on how mappings can be discovered by Linked Data applications

3. providing an open-source implementation of the R2R Mapping Engine.

]]


danja
2011-07-22T09:47:48+01:00
schema linkeddata r2r lod rdf vocab mapping
Related
Comments
Edit

Linked Data One-Liner

A lot of information is merely On the Web when it would be more useful In the Web...


danja
2011-04-04T03:44:39+01:00
linkeddata rdf
Related
Comments
Edit

Once more unto the breach (again)

For the first time in ages I've had a couple of days to sit down and look at code. A lot of it was stuff I hadn't finished, dating back a few years. The typical pattern was either getting distracted from the original aims and playing with the fun stuff or aiming to do so much that I never really got past square one. So this time around I've changed my mind, decided to keep the fun stuff (playing with Agents in Scala) separate from the main app work.

The main app in mind here is the Semantic Web in a Box idea which I'm back to thinking about in a more minimal form, informed a lot by what Rob wrote on his blog - What people find hard about Linked Data - and the stuff in the Talis tutorial. Basically what I'm after is a very easy-to-use Linked Data editor/visualization tool, with support for some kind of pluggability (TBD). There are existing tools which can do this sort of stuff, but the key here is to keep things as simple as possible (and free and open source). Target users are total beginners and experienced folks that want to be able to knock simple stuff together quickly. There's really not a lot to this, and 'wait long by the river and implementation of your plans will float by' usually works, but no-one really seems to have got around to this thing.

It'll be a Java/Swing desktop app with the following features:

  • Internal triplestore(s)
  • RDF editor with various views and syntax validation
  • SPARQL editor and results viewer
  • HTTP client (for examining remote resources, crawling and publishing to remote stores/services)
  • HTTP server (for simulating live data)
  • HTTP proxy (for examining headers etc)
  • Basic HTML editor/viewer


What should also be possible is to run it headless, as a live service.

Probably more than half the people that read this are likely to have such parts living in their codebases - Java Swing components, Jena, ARQ, and Apache HTTP libs cover an awful lot, the tricky part is wiring them all up in a useful way, with a UI that doesn't confuse.

I've made a start on gathering together the bits, but I'm unlikely to get down to a good coding session for a while again, so what follows is really notes to self so I don't forget...

So, RDF editor.

Currently the main class is org.hyperdata.swing.rdftree.editor.RdfEditor

One view is a resource-centered thing, based on a JTree backed by a Jena Model. Like everything else here, it's unfinished and very buggy (notably there's something like an out-by-one error on which row expands). But this should give the general idea, the paths should expand indefinitely :

rdf tree table

Right now it's only addressing the local model, but it should be reasonably straightforward to hook the HTTP client up to terminal node URIs to go and GET remote data (must check how Tabulator goes about that) and extending the drop-down paths.

Text views for Turtle and RDF/XML (with crude highlighting from JEditorPanes):

turtle editor

xml editor

I've only just started looking at a graph view (again!), separate from the stuff above - I just hacked at one of the JGraph demos, long way to go:

The launcher for that is org.hyperdata.swing.graph.danja.GraphEditor


graph view

I've stuck the code over here:

source, wiki etc.


danja
2010-11-21T18:47:57+01:00
swib linkeddata semweb rdf
Related
Comments
Edit

Linked Data and Hype

[in reply to John Sowa on the cg@conceptualgraphs.org list, unfortunately the mail didn't get through - something up with the server]

I reckon the activities around Linked Data are somewhat different to the typical "Next Big Thing". I'd suggest the NBT here if anything is the Semantic Web, which has suffered from industry hype, and as yet does not live up to the promises. However Linked Data is essentially the same idea as the Semantic Web, but with more emphasis on the "Web" side and less on the "Semantic".

The central idea of treating the Web conceptually as one big (graph-shaped) database works fine (and the LOD cloud [1] is a notable concrete manifestation), but as you note, most applications do require fast access to relevant data. Some of the more recent RDF stores/SPARQL engines do have performance comparable to traditional RDBs, but I don't think this is entirely relevant to the core paradigm. The tendency in the past has been for the creation of data silos, where each company or organization has their own discrete database. Where data is exposed to the Web it has been in the form of human-readable documents. This makes for a huge impedance mismatch for anyone wishing to use computers to make use of multiple data sources.

Where data is exposed to the Web as linked data, the material is available for direct recombination and reuse by other parties. When the appropriate standards are used (primarily URIs for identification, RDF for structure and HTTP for transfer) the notion of a database takes on a different form: a triplestore is a (fast) cache of a little chunk of the global Web of data.

Let's say electricity providers and water providers have their own databases. A company wishing to know where to lay fibre-optic cables would probably want to know where the existing (and planned) wiring/piping lies. Right now that would typically mean they'd need fairly in-depth knowledge of the database schemas and local conventions used by the utility companies. But if the data is available in a consistent form (i.e. RDF) then the work of aligning the source data and extracting the information becomes that much easier. The utilities may still have their own idiosyncratic ways of describing their systems, but then again if they happen to use some common vocabularies (e.g. for geo-location) considerably less expert knowledge of the individual systems is needed to get started. The fibre-optics company could run selective queries (or run a crawler) over the utilities' Web-exposed data, and trivially merge the results in their own, local, performant store.

The adoption of linked data has to some extent slipped under the radar of industry hype, a good example being http://data.gov.uk, which aims to take (non-personal) UK government data and expose it to the Web in a reusable form. The change in paradigm and increased potential for reuse is pretty apparent when you consider that a lot of the source data is held in Excel spreadsheets or buried in documents. This government-backed project has yielded a couple of surprises - on the one hand the willingness of gov departments to hand over their data and help out (the material being technically publicly available already, for practical reasons that can be far from the case). On the other hand developers have been fairly clamouring to get their hands on the data to build end-user applications.

(Incidentally, some of the data.gov.uk folks are working on the Linked Data API [2] which provides interfaces to triplestores which don't require any knowledge of RDF or SPARQL, which has traditionally been something of a blocker).


danja
2010-08-29T07:00:29+01:00
linkeddata semweb hype
Related
Comments
Edit