SPARQLing FOAFrolls@en

Sam Ruby's just been pulling feedlist info out of OPML and inserting it into PlanetPlanet's config file. While reading this I remembered a little item on my todo list and being in (potentially structured) procrastination mode decided there was no time like the present.

When I've played with aggregation in RDF with Python in the past, I've pulled out the feed subscription list programmatically (I've also been working around Redland, as does the Chumpalogica Planet code). But now SPARQL's available pretty much everywhere (supported by Redland/Rasqal, naturally), there's no good reason to hardcode such stuff. The expression of feedlists in RDF/XML is very verbose (mostly because it carries loads more information than URI+title) but is easy to work with. Here's a (slightly trimmed) example from PlanetRDF's blogroll:



<foaf:Agent>

   <foaf:name>John Barstow</foaf:name>

   <foaf:weblog>

     <foaf:Document rdf:about="http://www.nzlinux.org.nz/blogs/">

       <dc:title>Visions of Aestia</dc:title>

       <rdfs:seeAlso>

         <rss:channel rdf:about="http://www.nzlinux.org.nz/blogs/wp-rdf.php?cat=9" />

       </rdfs:seeAlso>

    </foaf:Document>

  </foaf:weblog>

</foaf:Agent>


As a first pass on getting the necessary info out, ten minutes playing with Leigh's Twinkle (*snigger*) gave me this:



PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX rss: <http://purl.org/rss/1.0/>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX dc: <http://purl.org/dc/elements/1.1/>



SELECT ?name ?title ?feed ?blog



WHERE {

   ?agent foaf:name ?name ;

          foaf:weblog ?blog .

   ?blog dc:title ?title ;

         rdfs:seeAlso ?feed .

   ?feed rdf:type rss:channel .

}



That's ok as far as it goes, but there's a good chance that automatically harvested data might be missing either the blog title or blogger name. So here's version two:

SELECT ?title ?feed ?blog



WHERE {

  ?agent foaf:weblog ?blog .

  ?blog rdfs:seeAlso ?feed .

  ?feed rdf:type rss:channel .

 OPTIONAL {

    ?blog dc:title ?title .

 }

  OPTIONAL {

    ?agent foaf:name ?title .

}

}


i.e. if the blog title is available, use that for the value of title, otherwise use the name of the agent (blogger).



The XML results of that query look like this:

<sparql xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:xsd="http://www.w3.org/2001/XMLSchema#"

xmlns="http://www.w3.org/2001/sw/DataAccess/rf1/result" >

  <head>

   <variable name="title"/>

   <variable name="feed"/>

   <variable name="blog"/>

  </head>

  <results>

   <result>

    <title>John Barstow</title>

    <feed uri="http://www.nzlinux.org.nz/blogs/wp-rdf.php?cat=9"/>

    <blog uri="http://www.nzlinux.org.nz/blogs/"/>

   </result>

  <result>

   <title>Plan B by Libby Miller</title>

   <feed uri="http://planb.nicecupoftea.org/index.rdf"/>

   <blog uri="http://planb.nicecupoftea.org/"/>

  </result>

...

  <results>

</sparql>


The PlanetPlanet configs are simple text files ( config.ini). After switching toolkit (to emacs + xsltproc at the command line), 10 minutes later I had this:



<xsl:stylesheet version="1.0"

xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

xmlns:res="http://www.w3.org/2001/sw/DataAccess/rf1/result">



<xsl:output method="text" />



<xsl:template match="res:sparql/res:results">

   <xsl:for-each select="res:result">

[<xsl:value-of select="res:feed/@uri"/>]

name = <xsl:value-of select="res:title"/>

  <xsl:text>

   </xsl:text>

  </xsl:for-each>

</xsl:template>



</xsl:stylesheet>


which produces:



[http://www.nzlinux.org.nz/blogs/wp-rdf.php?cat=9]

name = John Barstow



[http://planb.nicecupoftea.org/index.rdf]

name = Plan B by Libby Miller

...


- the way the feedlist looks in the config files.



Ok, procrastination over, I haven't time to look into how you might integrate this with PlanetPlanet, but that shouldn't be difficult - the Python RDFLib (as used in Sam and co's FeedValidator, no less) has had some SPARQL support for a while, not sure of the current status. But the interesting stuff only really starts after being able to read foafrolls - there's all kinds of other info available in RDF that could be useful to a Planet-style aggregator (especially if you did FOAF autodiscovery/XFN/Geo tag snagging/hCalendar GRDDL on the blogs).

But then the use of config.ini for the data would probably start looking clunky, so the logical thing to do would be to use an RDF store (maybe an RDF/XML file fronted by RDFLib). This would be a move from the simple elegance of the current planet.py. But then pretty much for free you could also use the store for persistence of entries, and facetted views of the person/entry data through SPARQL, plus (assuming RDFLib supports it), text search through SPARQL's regex support. A little RDF goes a long way...



Hmm, there's a little data point - it took me well over twice as long to write this post as it did to do the query and XSLT. The code was a lot more fun too ;-)

@en

Danny Ayers
2006-05-19T15:01:06+02:00

Related
Comments
Edit