Sitemap notes

Today I added a sitemap to this blog. Some notes-to-self.

Not sure what inspired me to do this, but I have been wanting a complete list of blog post URIs for a while to play around with augmenting the data (e.g. pulling out the links contained within posts and grabbing more info about them).

Blog engine general setup

The HTTP request routing first goes through Apache, if there's a file on the filesystem that matches the request, that is returned. If not, the request gets transparently forwarded to an instance of Gradino running on port 8080. The request gets dispatched through jax-rs to the appropriate handler in the code (most of the code is in Scala but using various Java libs). All the blog data is stored in a Jena TDB triplestore. When a request is made a SPARQL query is run programmatically against the store. The results are formatted as appropriate using a little crude templating (example). Results for the front page and feed are both cached as in-memory strings.

Adding sitemap generator

So for the sitemap, first pass I set things up in the same fashion as the front page and feed are generated, just without a LIMIT on the SPARQL query. This wound up making Apache give a proxy error, not sure exactly why (for some reason error messages didn't show) but it seemed reasonable to assume that it was somehow related to the quantity of results, maybe a silent timeout. I've got archives in the store going back years, my current query (excluding everything with "comment" in the URI) produces just over 5,500 results.

So then I decided to modify things to generate a static file when a POST was received at a particular URL. I should have seen this coming, but my initial attempt at this also gave a proxy error. D'oh! Performance-wise it was effectively the same routine running in the same thread.

But I was able to get it working by making the sitemap generator class a Scala Actor. When the appropriate POST is received, the handler creates a new instance of the Actor and sends it a message, but then continues along the original thread, returning an "ok" message to the browser.

Along the way I evolved what I reckoned was most suitable to put in the sitemap file. The blog front page just uses the core sitemap terms, and this is hard-coded:

<url>

<loc>http://dannyayers.com/</loc>

<changefreq>daily</changefreq>

<priority>0.9</priority>

</url>

Initially I had individual posts using the News sitemap terms, until just now I noticed that they are only for things that change a lot... So instead they just look like this:

<url>

<loc>http://dannyayers.com/2003/05/12/bufo-bufo/</loc>

<lastmod>1970-01-01T01:00:00Z</lastmod>

<changefreq>monthly</changefreq>

</url>

I've left it as monthly in case I want to change any of the template of the individually rendered pages, but reindexing isn't really a priority once the content text has been looked at.

Next I guess I should look at Semantic Sitemaps.

I'm typing this as yet another version of the generator code is running, I've kept making little errors that only show up when I point Google at the sitemap file... But if you're reading this then Gogle is happy with the current version :)


danja
2011-07-24T20:59:40+01:00
seo sitemap catalog gradino rdf
Related
Comments
Edit

Piano Piano

Where I'm staying at the moment I don't have much time to get on the computer, and net access is really lousy. But I've had a lot of chance to think about stuff that I want to do, and have realised that I can feed a few birds with one bean. The blog engine (this) I've been writing in Scala is approaching the basic level of functionality I wanted, so I'm looking again at a couple of old ideas.

The first is Semantic Web in a Box (new name needed!), the second an agent-based engine that will support scripting (I did a lightning talk about that at one of the SFSW meetups, must see if I can find the slides). Given that Scala actors are perfect for constructing the kind of agents I have in mind, as well as offering a nice way of doing the SemWeb in a Box stuff, I reckon I'll wrap it all together into one project. And the first application built with this setup can be a refactoring of my blog engine...

Many of the agents probably won't have all these features, but the stereotypical agent I want, a SemWebAgent, will have the following traits:

  • named with a URI
  • access from a HTTP server
  • access to a HTTP client
  • triplestore


+ some code that'll actually do something useful

Looking from outside, the things will look like regular Web-accessible resources, and can call/be called by external (RESTful) clients/services etc. Internally, if a particular named resource lies within the same VM then more direct messaging is possible. For scripting (when I get around to it), I've got Jython and Rhino (or equivalents) in mind. To support the pluggability of SemWeb in a Box, I'll go for OSGI, probably using Felix as the container.

I've started coding up the core actor stuff, which I will fill with unit tests as well - being new to Scala I'll no doubt make a lot of mistakes. I'm also putting together some functional tests for the blog engine, which I'll refactor to use this system. I'm already using a tiny bit of Apache Clerezza (for jax-rs handling handling of HTTP calls), I believe there'll be quite a lot more I can cherry-pick.


danja
2010-10-10T10:46:05+01:00
box clerezza gradino semweb rdf
Related
Comments
Edit

Per-Tag Feeds

I've just added a quick feature here so that if you go to a URI of the form /feed/tag/{TAG} it will produce an RSS 1.0 (RDF) feed for that tag. So hopefully /feed/tag/rdf will now be everything tagged "rdf".

PS. Silly me mistyped the above, so went back and started coding up item editing (as yet unimplemented)...then realised I'd already set things up so that if I post something with the same title on the same day it will already overwrite the previous entry (all the triples hanging off that URI). Heh.


danja
2010-09-27T15:23:45+01:00
code gradino rdf tags
Related
Comments
Edit

tags test

I've added some fresh tag-handling code to this blog, and in the process broken its rendering of tags, I'm finding generating XML in Scala a bit confusing. This will hopefully locate the problem.


danja
2010-08-12T10:20:45+01:00
test gradino
Related
Comments
Edit