The Semantic Web

This is a very old article. It has been imported from older blogging software, and the formatting, images, etc may have been lost. Some links may be broken. Some of the information may no longer be correct. Opinions expressed in this article may no longer be held.

One of my current interests is the semantic web — that is, the push to move from publishing text on the Web to publishing structured data, which can actually be understood by computers (in so far as a computer can truly “understand” anything). By publishing information so that computers can understand it, you make the Web into a huge mine of interconnected data, free to be queried by everyone.

As an example of what I mean, searching for the keyword “train” on Google brings up results related to:

trains, as a form of transport
the band Train
IT training courses
toy trains

In the semantic web, the search engine and my computer would inherently understand the difference between these concepts, so if I wanted to know about the new Train album, I wouldn’t get any result related to locomotives!

What I’m particularly interested in is ways of embedding semantic data in ordinary web pages, so that we have a single web that can be read by humans and machines instead of a two-tier web. At the moment I’m working on three main projects in this field, and I’d like to take some time to write about them now.

Microformats

Microformats are an effort to take common information which is already routinely published on the web, such as contact details, event information and geographic locations and make it machine-readable.

For example, the following HTML might be used to add a link from somewhere to this website:

Toby Inkster

We can “Microformat it up” by adding a few simple class attributes and an extra span element around the outside:


Toby Inkster

This enables Microformat-aware software (including some browser plugins) to know that that text represents a person, and that the person’s first name is “Toby” and last name is “Inkster”, and that the person’s website is at http://tobyinkster.co.uk. The software is then able to do useful stuff with that information. For example, a browser plugin might offer a one-click option to add the person to your address book.

The focus of Microformats tends to be to encode fairly simple information (e.g. the concept of a “contact” is probably easier to encode into HTML than the concept of an “organisation structure”) which has a high return for your investment — that is, encoding information about contacts, events and geographic locations is very useful because you can then automatically invite people to events, find maps for places, check people’s schedules and locations to select the best place to get together for coffee, etc. Encoding the organisational structure for a government department probably doesn’t offer as many immediate advantages.

The two Microformats that I’m most actively trying to push forward are:

hCalendar 1.1 — an update to the current Microformat for events, because I believe that the current specification is too vague and needs clarifying.
figure — a Microformat for describing images and diagrammes found in documents. I’m working on this because I think it should be a very simple Microformat to get working, could be applied to a large number of documents out there, and should be very useful.

RDFa

RDFa is not a Microformat, but I thought I’d mention it before I got onto my next project. RDFa is another way of adding semantics to normal HTML documents. However, while Microformats deals with different concepts (contacts, events, locations, etc) on a case-by-case basis, with each concept requiring a new Microformat to be developed, RDFa deals with the more general case. It provides a framework on which a document author can “hang” any metadata they want.

It’s an immensely powerful set of tools, but will probably take quite a while to gain momentum.

Although I’ve been watching it with interest, I’ve not contributed anything towards the RDFa specification so far, mostly because it seems to be progressing along quite nicely without me. What I have been looking at is how RDFa and Microformats can work together, which brings me to my next project…

Cognition

For about a month now, I’ve been working on Cognition” a tool designed to find and extract semantic data within web pages. Ultimately it aims to be a full-fledged graphical browser, but that’s a long way off — right now it just parses the data.

Cognition supports both RDFa and Microformats, plus a bunch of other ways of encoding information. It has been very interesting seeing how all these different semantic technologies can work together.

You can try it online to see what metadata Cognition can find on your web page!

demiblog

I have mentioned demiblog several times before. It’s the content management system that does all the behind-the-scenes on this website. Since day one, it has had a focus on semantics, and my work on Cognition and Microformats have been helping it improve in this area over the last few weeks.

One major recent advance is that it’s moved from the old Dublin Core 1.1 vocabulary for describing articles, to the newer DC Terms vocabulary.

Overall, development on demiblog hasn’t been as fast as I’d hoped — I’ve not been able to devote as much time to it as I’d have liked. It is creeping towards a 0.3.0 release though. I’ll get there eventually.

So, you see, I’m trying to work on improving the semantic web from three directions: a tool for publish semantic data, standards for encoding semantic data, and a tool for using semantic data.