Pezholio

A Beginner's Guide to SPARQLing Linked Data (Part 1)

| Comments

Regular readers of this blog will have seen that, over the past 12 months or so, I’ve been banging on about linked data and SPARQL quite a bit, posting up example queries and the like, but with not much explanation about why I’m doing what I’m doing.

Thanks to the good folks at Talis and their offer of a free linked data store for spending data, I’ve also got a nice little store of my own, and I thought it was high time I went back to basics and passed on what I’ve learnt to other people.

First, a bit of background

The web, as most of us know it, is a network of documents, with each document having an individual URI to show its location on the web. With linked data, the web becomes an network of things with each thing having an individual URI to tell us more about it.

When talking about linked data, we talk about URIs, rather than URLs, as URLs refer the the location of a file, whereas URIs are identifiers for things, which could return any format.

An example of a URI is http://spending.lichfielddc.gov.uk/costcentre/waste-shared-service - when you access this via a web browser, the web browser asks for HTML and a web page giving information about this particular cost centre is returned. If you request this in any other way (say via the command line) and ask for RDF, you will get an RDF representation of information about this cost centre (web browsers filter out some of the XML, so it might be worth right clicking and selecting “view source” to see everything).

As well as identifying things, URIs also represent categories of things (called properties), the previous example therefore is an expenditure category, represented by the URI http://reference.data.gov.uk/def/payment#expenditureCategory. To get all the properties being used for our spending dataset, we can do the following SPARQL query:

SELECT DISTINCT ?o WHERE {?s a ?o}

See the results of this query

SPARQL queries themselves

SPARQL stands for SPARQL Protocol and RDF Query Language and is a similar sort of idea to SQL (which is used in most databases both on and off the web). The main difference is that while SQL generally allows us read and write from and to a database (and therefore not particularly safe to give everyone access), SPARQL is built for public access, so is non-destructive in nature. SPARQL is, of course, still intended for developers and is just as powerful as SQL for querying datasets.

RDF datastores have a SPARQL endpoint, which is basically a fancy name for a place where you can make SPARQL queries. Our SPARQL endpoint is located at http://api.talis.com/stores/lichfielddc-gov-uk/services/sparql and we make SPARQL queries by posting a GET request to this URI in the format http://api.talis.com/stores/lichfielddc-gov-uk/services/sparql?query={SPARQL query goes here}, or by entering our query on a web-based form that posts to our endpoint. The form we will be using is located here.

The first thing we do when writing a SPARQL query is define our prefixes. If you look at a snapshot of the data we’ll be querying, you’ll see that most of the XML tags are written in the format prefix:type> (RDF can be written in various formats, but I’m using XML, which is not necessarily the best format, but that’s a discussion for another day!). If you look at the top of the document (you might need to view source in your web browser) you’ll see that all the prefixes are defined at the top as XML namespaces (for example ‘xmlns:payment=”http://reference.data.gov.uk/def/payment#”’ means that the prefix “payment” will be shorthand for “http://reference.data.gov.uk/def/payment#”, so “payment:reference” actually refers to a URI, in this case “http://reference.data.gov.uk/def/payment#reference”). We can do this in a similar way in our SPARQL like so:

PREFIX payment: http://reference.data.gov.uk/def/payment#

There may be other prefixes too, and we do these a line at a time, for example:

PREFIX payment: http://reference.data.gov.uk/def/payment# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# PREFIX xsd: http://www.w3.org/2001/XMLSchema#

The next thing to do is choose what data we want to return (in a similar way to an SQL SELECT query)

SELECT ?payment WHERE {

Here we’re asking for one part of the dataset, which will be the URI which represents a particular payment. At this point, it doesn’t matter what we call it, as we need to define this in the next bit of our query:

?payment a payment:Payment .

Here we’re saying “You know that thing I asked for in the SELECT part of the statement? That needs to be a http://reference.data.gov.uk/def/payment#Payment”.

We could stop here, but we’ll end up getting all the data back, which would take a long time, and probably not be very useful, so let’s filter this by only asking for payments made to a particular supplier:

?Payment payment:payee http://spending.lichfielddc.gov.uk/supplier/burntwood-road-sweepers-ltd

Let’s put it all together:

PREFIX payment: http://reference.data.gov.uk/def/payment# SELECT ?Payment WHERE { ?Payment a payment:Payment . ?Payment payment:payee http://spending.lichfielddc.gov.uk/supplier/burntwood-road-sweepers-ltd }

See the results of the query.

The query returns a list of URIs, if you copy and paste these into your browser, you’ll see a web page about that payment. The list isn’t very useful in itself, so let’s ask for a bit more data.

We’ll first modify the prefixes:

PREFIX payment: http://reference.data.gov.uk/def/payment# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema#

Then the rest of the query:

SELECT ?Payment ?label ?date { ?Payment a payment:Payment . ?Payment payment:payee http://spending.lichfielddc.gov.uk/supplier/burntwood-road-sweepers-ltd . ?Payment rdfs:label ?label . ?Payment payment:date ?date }

See the results of this query.

Hopefully you’ll be able to see what’s going on here, we’re now asking for the label of the payment (rdfs:label) and also the date (Payment:date). In this dataset, the date is returned as a URI (i.e. http://reference.data.gov.uk/id/day/2010-04-26), but we can also return the date’s label in text form by adding the following:

SELECT ?Payment ?label ?date **?datelabel** {

…snip

?date rdfs:label ?datelabel .

Here we’re saying “You know that date I asked for? I also want to see it’s label, and I want to call it datelabel”. The query now looks like this:

PREFIX payment: http://reference.data.gov.uk/def/payment# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?Payment ?label ?date ?datelabel { ?Payment a payment:Payment . ?Payment payment:payee http://spending.lichfielddc.gov.uk/supplier/burntwood-road-sweepers-ltd . ?Payment rdfs:label ?label . ?Payment payment:date ?date . ?date rdfs:label ?datelabel }

See the results of this query.

This is all well and good, but we don’t see any actual figures yet, which isn’t very useful. Going back to the snapshot of data I showed earlier, you can see that each payment has one or more expenditureLines, expressed as:

payment:expenditureLine rdf:resource="http://spending.lichfielddc.gov.uk/spend/9178363"/

If you look below each Payment, you’ll see the expenditureLine(s) for the payment, e.g:

payment:ExpenditureLine rdf:about="http://spending.lichfielddc.gov.uk/spend/9178363" rdfs:labelPayment number 9178363/rdfs:label payment:expenditureCategory rdf:resource="http://spending.lichfielddc.gov.uk/costcentre/waste-shared-service"/ payment:expenditureCategory rdf:resource="http://spending.lichfielddc.gov.uk/type/supplies-services"/ payment:expenditureCategory rdf:resource="http://spending.lichfielddc.gov.uk/subjective/compost-disposal-costs"/ payment:payment rdf:resource="http://spending.lichfielddc.gov.uk/invoice/16148"/ payment:netAmount rdf:datatype="http://www.w3.org/2001/XMLSchema#decimal"1855.88/payment:netAmount qb:dataSet rdf:resource="http://spending.lichfielddc.gov.uk/dataset.rdf#2010-12-v1"/ /payment:ExpenditureLine

We can get this information in a similar way to how we got the date labels, so if we want to get a net amount for each payment we can add:

SELECT ?Payment **?line** ?label ?date ?datelabel **?amount** {

…snip

?line payment:payment ?Payment . ?line payment:netAmount ?amount

Here we’re now asking for each line that has a payment that is equal to “?Payment” (i.e. the Payments we’ve originally requested) and asking for their net amounts. The query now looks like this:

PREFIX payment: http://reference.data.gov.uk/def/payment# PREFIX rdfs: http://www.w3.org/2000/01/rdf-schema# SELECT ?Payment ?line ?label ?date ?datelabel ?amount { ?Payment a payment:Payment . ?Payment payment:payee http://spending.lichfielddc.gov.uk/supplier/burntwood-road-sweepers-ltd . ?Payment rdfs:label ?label . ?Payment payment:date ?date . ?date rdfs:label ?datelabel . ?line payment:payment ?Payment . ?line payment:netAmount ?amount }

See the results of this query.

Which seems like a pretty neat place to leave it now. Feel free to have a play with the stuff I’ve gone thorough so far, and ask any questions in the comments. I don’t claim to have all the answers, and I may have made some incorrect assumptions, so please feel free to put me right too!

Stay tuned for part 2, where I’ll cover filters and other cool things.

Comments