March 28, 2003
amk @ amk.ca
The Semantic Web has been a W3C project since around 1999.
The existing Web of HTML documents is good for humans:
The Semantic Web will augment the existing human-readable Web with structured data that's easy for software to process.
The current architecture for the Semantic Web is split into three layers:
From lowest to highest:
The Resource Description Framework (RDF) is a generic format for metadata.
What's metadata? It's information about other data:
RDF is a specification that defines a model for representing the world, and a syntax for serializing and exchanging that model.
I have a bunch of book reviews, and I want people to be able to find them and use them in other applications.
The simple approach: a comma-separated file of ISBNs and URLs:
How do we incorporate title, author, and all that? Add more columns...
What if there are multiple authors?
Let's try a more self-describing approach:
ISBN: 1-930110-11-1 URL: http://example.com/rev1 Author: DuCharme, Bob Author: Author2, A. Title: XSLT Quickly Publisher: Manning Pages: 450pp
What if the author of a book has a web site?
ISBN: 1-930110-11-1 Author: DuCharme, Bob Author-URL: http://www.snee.com/bob/ Author2: Author2, A. Author2-URL: http://a2.example.com/ ...
This gets the model slightly wrong: the web site is a property of the author, not the book.
So let's track reviews and authors in separate files.
ISBN: 1-930110-11-1 Author-ID: 0042 ...
Author-ID: 0042 Name: DuCharme, Bob URL: http://www.snee.com/bob/ Author-ID: 0043 Name: Undset, Sigrid ...
What if we wanted to store information about publishers, or author web sites?
In the RDF view of the world, everything is a graph.
Resources are identified by URIs
How are properties identified? They could be just names or serial numbers, but that wouldn't be very scalable.
Instead, properties have URIs just like resources.
Definition: RDF statements are always
RDF Core defines an XML-based serialization for RDF.
<rdf:RDF xmlns:FOAF="http://xmlns.com/foaf/0.1/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rev="http://amk.ca/xml/review/1.0#"> <!-- Implies rdf:type property is rev:Review --> <rev:Review rdf:about="http://example.com/rev1"> <rev:subject rdf:resource="urn:isbn:1930110111"/> </rev:Review> <rdf:Description rdf:about="http://example.com/author/0042"> <FOAF:firstName>Bob</FOAF:firstName> <FOAF:homepage rdf:resource="http://www.snee.com/bob/"/> <FOAF:pastProject rdf:resource="urn:isbn:1930110111"/> <FOAF:surname>DuCharme</FOAF:surname> </rdf:Description> </rdf:RDF>
An informal syntax that's easier to read and easier to scribble.
@prefix rev: <http://amk.ca/xml/review/1.0#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix FOAF: <http://xmlns.com/foaf/0.1/> . <http://example.com/rev1> rev:subject [ = <urn:isbn:1930110111>; dc:title "XSLT Quickly"; dc:creator <http://example.com/author/0042>; dc:publisher "Manning" ] . <http://example.com/author/0042> FOAF:firstName "Bob"; FOAF:surname "DuCharme"; FOAF:homepage <http://www.snee.com/bob/>; FOAF:pastProject <urn:isbn:1930110111> .
The most basic form of RDF software is simply an RDF parser. Parsers are available for most of the languages you might need:
Here's a Python example using rdflib 1.2. (www.rdflib.net)
First, create an InformationStore, which is a database of triples. This store will be stored as a BerkeleyDB database, so it'll be persistent.
from rdflib.InformationStore import InformationStore # Create an InformationStore to hold the RDF data store = InformationStore() store.open('/tmp/temporary-store')
You can add triples to InformationStores:
from rdflib.URIRef import URIRef from rdflib.Literal import Literal DC_TITLE = URIRef('http://purl.org/dc/elements/1.1/title') REVIEW_SUBJECT = URIRef('http://amk.ca/xml/review/1.0#subject') book_uri = URIRef('urn:isbn:0609602330') store.add((URIRef('http://www.amk.ca/books/h/Isaacs_Storm.html'), REVIEW_SUBJECT, book_uri ))
You can also remove a triple:
store.remove((URIRef('http://www.amk.ca/books/h/Isaacs_Storm.html'), REVIEW_SUBJECT, book_uri ))
You can add the contents of a URL, parsing the data as RDF/XML:
When you load a file of data, a ContextStore object is returned. This can be used later to delete assertions that came from this file.
# Load a file context = store.load('http://www.amk.ca/books/h/Bridal_Wreath.rdf') # Remove the triples added from this file store.remove_context(context.identifier)
The most general query method is triples(), which takes a (subject, property, object) 3-tuple, returning an iterator over the matching triples.
For example, to list all things which have a dc:title property:
>>> DC_TITLE = URIRef('http://purl.org/dc/elements/1.1/title') >>> for s,p,o in store.triples((None, DC_TITLE, None)): ... print s,p,o ... urn:isbn:0609602330 http://purl.org/dc/elements/1.1/title \ Isaac's Storm urn:isbn:1930110111 http://purl.org/dc/elements/1.1/title \ XSLT Quickly >>>
Once we've parsed a pile of RDF about reviews, we can find all reviews of a particular book:
REVIEW_SUBJECT = URIRef('http://amk.ca/xml/review/1.0#subject') t = (None, REVIEW_SUBJECT, URIRef('urn:isbn:1930110111')) for subj,pred,obj in store.triples(t): print subj
For a fancier query, you have to implement the logic in Python. To list all reviews of books by an author with a particular name:
# Loop over all the books whose dc:creator # property is the right value for book, pred, obj in store.triples((None, DC_CREATOR, Literal("DuCharme, Bob"))): ##print 'Book URI:', book t = (None, REVIEW_SUBJECT, book) for review, pred, obj in store.triples(t): print 'Review URL, reviewed book URI:', review, book
Versa (uche.ogbuji.net/tech/rdf/versa/) is a query language for searching RDF models.
The following query lists all reviews of books by an author with a particular name (probably not optimally):
((all() |- foaf:name -> eq("DuCharme, Bob", .)) <- dc:creator - true) <- rev:subject - .
RSS = Really Simple Syndication.
def update_rss (entries): u = URIRef # Describe the channel ts = TripleStore() channel_uri = u(BASE_URL) ts.add(channel_uri, TYPE, u(RSS_URI+'channel')) ts.add(channel_uri, u(RSS_URI+'title'), Literal(WEBLOG_TITLE)) ts.add(channel_uri, u(RSS_URI+'link'), Literal(BASE_URL)) ts.add(channel_uri, u(RSS_URI+'description'), Literal(WEBLOG_DESCRIPTION)) # Create an RDF Sequence (doing it by hand -- ick!) seq = BNode() ts.add(seq, TYPE, SEQ) ts.add(channel_uri, u(RSS_URI + 'items'), seq)
# Add items to the Sequence for index, (e_date, entry) in enumerate(entries): url = u(entry.permalink) title = entry.title e_date = e_date.date() ts.add(url, TYPE, u(RSS_URI+'item')) ts.add(url, u(DC_URI + 'title'), Literal(title)) ts.add(url, u(DC_URI + 'date'), Literal(e_date.isoformat())) # Add item URL to sequence: the property names are # ...#_1, ...#_2, ...#_3, ... ts.add(seq, u(RDFNS + '_' + str(index+1)), url) # Write the RSS file to a file ts.save("index.rss")
With RDF, we can refer to resources and list a bunch of their properties. But how do we know when a resource is a review?
We need a way to say "Resource X is of the class Review." This can be expressed as a triple:
(Resource, rdf:type, class-URI)
Therefore, RDF classes are described in RDF. (Gets a bit head-bending at times...)
Here's an example in N3 that declares two classes.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . # Declare a Review class rev:Review # class URI: http://amk.ca/xml/review/1.0#Review rdf:type rdfs:Class ; rdf:ID "Review" ; rdfs:comment """Reviews are resources that express an opinion about some other resource.""" ; . # Declare a subclass of Review. rev:ComparativeReview a rdfs:Class ; rdfs:subClassOf rev:Review . rdfs:comment """Comparative reviews examine multiple resources, comparing their relative merits and usually offering an opinion about which one is the best.""" ;
To declare that a particular resource is a rev:Review, assert that the resource's rdf:type property is the class:
# Declare a resource <http://example.com/rev1> rdf:type rev:Review .
Once classes have been defined, you can also list their properties. The following fragment defines the rev:subject property:
rev:subject rdf:type rdf:Property; rdfs:label "Subject property" ; rdfs:domain rev:Review ; rdfs:range rdfs:Resource ; rdfs:comment "Value is the resource being reviewed." ; .
With RDF Schema, we know:
We don't know:
OWL is a W3C language for defining this sort of relationship, currently still in the rough draft stage.
It's based on two existing research languages, the American DAML (DARPA Agent Markup Language) and the European OIL (Ontology Inference Layer).
Here's an OWL declaration of a class representing persons.
owl:Class is a subclass of the RDF Schema rdfs:Class, so software that's RDF Schema-aware but not OWL-aware can still work just fine.
@prefix gen: <http://genealogy.example.com/schema#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . gen:Person a owl:Class; rdf:ID "person" ; rdfs:comment "Resource representing a person." ; .
owl:TransitiveProperty is a subclass of owl:ObjectProperty which is a subclass of rdfs:Property.
gen:ancestor a owl:TransitiveProperty; rdfs:domain gen:Person; rdfs:range gen:Person; .
DAML+OIL and OWL add two more pieces of the final puzzle, one simple and one complicated:
The simple part: OWL adds the ability to indicate when two classes or properties are identical.
The complex part: OWL declarations provide additional information to let rule-checking and theorem-proving systems work with RDF data.
So how much of this stuff do you need to learn about and use?
But we don't need to aim for the stars. Simple things can be done without much effort, and can still be useful:
On the other hand, there are signs of life: the DAML crawler (www.daml.org/crawler/) found 18,500 pages containing RDF in May 2002, and 706,821 pages as of today.
These slides: www.amk.ca/talks/semweb-intro