Introduction to the Semantic Web and RDF

PyCon
March 28, 2003

A.M. Kuchling
www.amk.ca
amk @ amk.ca

Overview of the Semantic Web

The Semantic Web has been a W3C project since around 1999.

The existing Web of HTML documents is good for humans:

Needs only a single program to access it: the browser.
Programs have a harder time with it; it's too messy.
Screen-scrapers can extract information from HTML, but they're hard to write.

The Semantic Web will augment the existing human-readable Web with structured data that's easy for software to process.

Layers of the Semantic Web

The current architecture for the Semantic Web is split into three layers:

From lowest to highest:

Resource Description Framework (RDF): lets you assert facts
e.g. person X is named "Drew".
RDF Schema: lets you describe vocabularies and use them to describe things
e.g. person X is a LivingPerson.
Web Ontology Language (OWL): lets you describe relationships between vocabularies
e.g. persons in schema A are the same thing as users in schema B.

Overview of RDF

The Resource Description Framework (RDF) is a generic format for metadata.

What's metadata? It's information about other data:

For a web page: Author, timestamp, content-type.
For a photograph: Photographer, subject, timestamp, camera model, film used.
For an astronomical observation: date/time, coordinates, instrument, part of the instrument.

RDF is a specification that defines a model for representing the world, and a syntax for serializing and exchanging that model.

Motivating Example: Book Reviews

I have a bunch of book reviews, and I want people to be able to find them and use them in other applications.

Model:

Some web pages are reviews of something else: a book, a recording, another web page.
The item being reviewed has various properties: a title, an author, an ISBN (for books, at least).

The simple approach: a comma-separated file of ISBNs and URLs:

1-930110-11-1,http://example.com/rev1
0-471-21822-7,http://example.com/rev2

How do we incorporate title, author, and all that? Add more columns...

What if there are multiple authors?

Motivating Example: RFC-2822 Approach

Let's try a more self-describing approach:

ISBN: 1-930110-11-1
URL: http://example.com/rev1
Author: DuCharme, Bob
Author: Author2, A.
Title: XSLT Quickly
Publisher: Manning
Pages: 450pp

What if the author of a book has a web site?

ISBN: 1-930110-11-1
Author: DuCharme, Bob
Author-URL: http://www.snee.com/bob/
Author2: Author2, A.
Author2-URL: http://a2.example.com/
  ...

This gets the model slightly wrong: the web site is a property of the author, not the book.

Example: RFC-2822: Second try

So let's track reviews and authors in separate files.

reviews.dat	authors.dat
ISBN: 1-930110-11-1 Author-ID: 0042 ...	Author-ID: 0042 Name: DuCharme, Bob URL: http://www.snee.com/bob/ Author-ID: 0043 Name: Undset, Sigrid ...

What if we wanted to store information about publishers, or author web sites?

An Example RDF Graph

In the RDF view of the world, everything is a graph.

[Example RDF graph]

RDF Graph: Resources

Resources are identified by URIs

e.g. http://example.com/person/0042, urn:isbn:1930110111
URI(dentifier)s are not necessarily URL(ocator)s

[Example RDF graph highlighting resources]

RDF Graph: Literals

A primitive string value.
The interpretation of the string is up to your application.

[Example RDF graph highlighting literals]

RDF Graph: Properties

An attribute or aspect of a resource.
A property value can be a literal or another resource.
Multiple values are allowed; no value at all is also legal.

[Example RDF graph highlighting property arcs]

RDF Graph: Property URIs

How are properties identified? They could be just names or serial numbers, but that wouldn't be very scalable.

Instead, properties have URIs just like resources.

Pick a base URI for your RDF model:
http://amk.ca/xml/review/1.0#
The base URI is assigned to a prefix, such as "rev".
Properties are then referenced as 'rev:subject', 'rev:topic', etc.
The RDF parser will concatenate the base URI (from the prefix) and the name:
- rev:subject → http://amk.ca/xml/review/1.0#subject
- Without the '#', you'd get .../1.0subject

RDF statements and triples

Definition: RDF statements are always
(subject,property,object) 3-tuples.

Subject	Property	Object
`http://example.com/rev1`	`rev:subject → http://amk.ca/xml/review/1.0#subject`	`urn:isbn:1930110111`
`urn:isbn:1930110111`	`dc:title → http://purl.org/dc/elements/1.1/title`	"XSLT Quickly"
`urn:isbn:1930110111`	`dc:creator → http://purl.org/dc/elements/1.1/creator`	`http://example.com/author/0042`
`http://example.com/author/0042`	`FOAF:surname → http://xmlns.com/foaf/0.1/surname`	DuCharme
`http://example.com/author/0042`	`FOAF:homepage → http://xmlns.com/foaf/0.1/homepage`	`http://www.snee.com/bob/`
`http://example.com/author/0042`	`FOAF:pastProject → http://xmlns.com/foaf/0.1/pastProject`	`urn:isbn:1930110111`

RDF syntaxes: RDF/XML

RDF Core defines an XML-based serialization for RDF.

<rdf:RDF 
    xmlns:FOAF="http://xmlns.com/foaf/0.1/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rev="http://amk.ca/xml/review/1.0#">

    <!-- Implies rdf:type property is rev:Review -->
    <rev:Review rdf:about="http://example.com/rev1">
        <rev:subject rdf:resource="urn:isbn:1930110111"/>
    </rev:Review>

    <rdf:Description rdf:about="http://example.com/author/0042">
        <FOAF:firstName>Bob</FOAF:firstName>
        <FOAF:homepage rdf:resource="http://www.snee.com/bob/"/>
        <FOAF:pastProject rdf:resource="urn:isbn:1930110111"/>
        <FOAF:surname>DuCharme</FOAF:surname>
    </rdf:Description>
</rdf:RDF>

RDF syntaxes: Notation-3 (or N3)

An informal syntax that's easier to read and easier to scribble.

@prefix rev: <http://amk.ca/xml/review/1.0#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix FOAF: <http://xmlns.com/foaf/0.1/> .

<http://example.com/rev1> rev:subject [
   = <urn:isbn:1930110111>;
   dc:title "XSLT Quickly";
   dc:creator <http://example.com/author/0042>;
   dc:publisher "Manning" ] .

<http://example.com/author/0042> 
    FOAF:firstName "Bob";
    FOAF:surname "DuCharme";
    FOAF:homepage <http://www.snee.com/bob/>;
    FOAF:pastProject <urn:isbn:1930110111> .

RDF's Sins and Virtues

Virtues:

RDF exists and is strictly specified.
Conceptually the graph model is pretty simple.
RDF has a number of implementations.
RDF is decentralized:
- Anyone can create a vocabulary.
- Anyone can publish data about other resources.

Sins:

RDF/XML is rather verbose, and tedious to read/write by hand.
Programming interfaces require you to know about triples, URIs, and all these low-level details.

Available RDF Software

The most basic form of RDF software is simply an RDF parser. Parsers are available for most of the languages you might need:

Python (rdflib, 4RDF, cwm)
Perl (RDF::Core)
C (libwww, Redland)
- Thanks to SWIG, Redland also has interfaces for most of the other scripting languages.
Java (Jena)
PHP, Ruby, Prolog all have RDF parsers.

Example code: Initializing an RDF database

Here's a Python example using rdflib 1.2. (www.rdflib.net)

First, create an InformationStore, which is a database of triples. This store will be stored as a BerkeleyDB database, so it'll be persistent.

from rdflib.InformationStore import InformationStore

# Create an InformationStore to hold the RDF data
store = InformationStore()
store.open('/tmp/temporary-store')

Example code: Modifying the database

You can add triples to InformationStores:

from rdflib.URIRef import URIRef
from rdflib.Literal import Literal

DC_TITLE = URIRef('http://purl.org/dc/elements/1.1/title')
REVIEW_SUBJECT = URIRef('http://amk.ca/xml/review/1.0#subject')
book_uri = URIRef('urn:isbn:0609602330')

store.add((URIRef('http://www.amk.ca/books/h/Isaacs_Storm.html'), 
           REVIEW_SUBJECT,
           book_uri
           ))

You can also remove a triple:

store.remove((URIRef('http://www.amk.ca/books/h/Isaacs_Storm.html'), 
              REVIEW_SUBJECT,
              book_uri
              ))

Example code: Loading RDF data

You can add the contents of a URL, parsing the data as RDF/XML:

store.load('http://www.amk.ca/books/h/Bridal_Wreath.rdf')

When you load a file of data, a ContextStore object is returned. This can be used later to delete assertions that came from this file.

# Load a file
context = store.load('http://www.amk.ca/books/h/Bridal_Wreath.rdf')

# Remove the triples added from this file
store.remove_context(context.identifier)

Example code: Querying the database

The most general query method is triples(), which takes a (subject, property, object) 3-tuple, returning an iterator over the matching triples.

Each element of the tuple can contain a URIRef or Literal instance, or None to match anything.

For example, to list all things which have a dc:title property:

>>> DC_TITLE = URIRef('http://purl.org/dc/elements/1.1/title')
>>> for s,p,o in store.triples((None, DC_TITLE, None)): 
...     print s,p,o
...
urn:isbn:0609602330 http://purl.org/dc/elements/1.1/title \
   Isaac's Storm
urn:isbn:1930110111 http://purl.org/dc/elements/1.1/title \
   XSLT Quickly
>>>

Example code: Querying II

Once we've parsed a pile of RDF about reviews, we can find all reviews of a particular book:

REVIEW_SUBJECT = URIRef('http://amk.ca/xml/review/1.0#subject')
t = (None, REVIEW_SUBJECT, URIRef('urn:isbn:1930110111'))
for subj,pred,obj in store.triples(t):
    print subj

For a fancier query, you have to implement the logic in Python. To list all reviews of books by an author with a particular name:

# Loop over all the books whose dc:creator 
# property is the right value
for book, pred, obj in store.triples((None, 
                                      DC_CREATOR, 
                                      Literal("DuCharme, Bob"))):
    ##print 'Book URI:', book
    t = (None, REVIEW_SUBJECT, book)
    for review, pred, obj in store.triples(t):
        print 'Review URL, reviewed book URI:', review, book

Versa: An RDF query language

Versa (uche.ogbuji.net/tech/rdf/versa/) is a query language for searching RDF models.

The following query lists all reviews of books by an author with a particular name (probably not optimally):

((all() |- foaf:name -> eq("DuCharme, Bob", .))
   <- dc:creator - true)
       <- rev:subject - .

Larger example: Making an RSS file

RSS = Really Simple Syndication.

def update_rss (entries):
    u = URIRef

    # Describe the channel
    ts = TripleStore()
    channel_uri = u(BASE_URL)
    ts.add(channel_uri, TYPE, u(RSS_URI+'channel'))
    ts.add(channel_uri, u(RSS_URI+'title'), 
           Literal(WEBLOG_TITLE))
    ts.add(channel_uri, u(RSS_URI+'link'), 
           Literal(BASE_URL))
    ts.add(channel_uri, u(RSS_URI+'description'), 
           Literal(WEBLOG_DESCRIPTION))

    # Create an RDF Sequence (doing it by hand -- ick!)
    seq = BNode()
    ts.add(seq, TYPE, SEQ)
    ts.add(channel_uri, u(RSS_URI + 'items'), seq)

RSS file (cont'd)

    # Add items to the Sequence    
    for index, (e_date, entry) in enumerate(entries):
        url = u(entry.permalink)
        title = entry.title
        e_date = e_date.date()

        ts.add(url, TYPE, u(RSS_URI+'item'))
        ts.add(url, u(DC_URI + 'title'), Literal(title))
        ts.add(url, u(DC_URI + 'date'),
               Literal(e_date.isoformat()))
               
        # Add item URL to sequence: the property names are 
        # ...#_1, ...#_2, ...#_3, ...
        ts.add(seq, u(RDFNS + '_' + str(index+1)), url)
        
    # Write the RSS file to a file
    ts.save("index.rss")

RDF Schema

With RDF, we can refer to resources and list a bunch of their properties. But how do we know when a resource is a review?

We need a way to say "Resource X is of the class Review." This can be expressed as a triple:

(Resource, rdf:type, class-URI)

Therefore, RDF classes are described in RDF. (Gets a bit head-bending at times...)

RDF Schema: Classes

Here's an example in N3 that declares two classes.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

# Declare a Review class

rev:Review  # class URI: http://amk.ca/xml/review/1.0#Review
   rdf:type rdfs:Class ;
   rdf:ID "Review" ; 
   rdfs:comment """Reviews are resources that express an opinion 
about some other resource.""" ;
.

# Declare a subclass of Review.
rev:ComparativeReview a rdfs:Class ;
   rdfs:subClassOf rev:Review .
   rdfs:comment """Comparative reviews examine multiple resources,
comparing their relative merits and usually offering an opinion
about which one is the best.""" ;

RDF Schema: Declaring an instance

To declare that a particular resource is a rev:Review, assert that the resource's rdf:type property is the class:

# Declare a resource    
<http://example.com/rev1> rdf:type rev:Review .

rdfs:Property

Once classes have been defined, you can also list their properties. The following fragment defines the rev:subject property:

rev:subject rdf:type rdf:Property;
    rdfs:label "Subject property" ;
    rdfs:domain rev:Review ;
    rdfs:range rdfs:Resource ;
    rdfs:comment "Value is the resource being reviewed." ;
    .

Web ontology: Describing vocabularies

With RDF Schema, we know:

Which classes exist,
What their properties are.

We don't know:

When are two classes the same?
Can properties have multiple values?
- e.g. Can a book have more than one author?
- ... no authors?
- ... more than one publisher?
If X property Y and Y property Z are both true, which of the following are true?
- X property Z (transitivity)
- Y property X (symmetry)
- Z property X

Web ontology: DAML+OIL and OWL

OWL is a W3C language for defining this sort of relationship, currently still in the rough draft stage.

It's based on two existing research languages, the American DAML (DARPA Agent Markup Language) and the European OIL (Ontology Inference Layer).

OWL: Class declaration

Here's an OWL declaration of a class representing persons.

owl:Class is a subclass of the RDF Schema rdfs:Class, so software that's RDF Schema-aware but not OWL-aware can still work just fine.

@prefix gen: <http://genealogy.example.com/schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

gen:Person a owl:Class;
    rdf:ID "person" ;
    rdfs:comment "Resource representing a person." ;
    .

OWL: Property declaration

owl:TransitiveProperty is a subclass of owl:ObjectProperty which is a subclass of rdfs:Property.

gen:ancestor a owl:TransitiveProperty;
    rdfs:domain gen:Person;
    rdfs:range gen:Person;
    .

Web ontology: What's the point?

DAML+OIL and OWL add two more pieces of the final puzzle, one simple and one complicated:

The simple part: OWL adds the ability to indicate when two classes or properties are identical.

This lets bodies of data in different schemas be linked together.

The complex part: OWL declarations provide additional information to let rule-checking and theorem-proving systems work with RDF data.

e.g. the 'ancestor' property is transitive.
If (X ancestor Y) and (Y ancestor Z) are true, a system could infer that (X ancestor Z) is also true.

What should you care about?

So how much of this stuff do you need to learn about and use?

RDF
- Yes, definitely!
- It's the base of the pyramid, and is widely implemented.
- You're free to hard-wire property URIs into your code.
RDF Schema
- Will likely be important in future.
- If you hard-wire URIs, your software will miss extensions to your data model.
- But... you can get useful work done without it.
OWL / DAML+OIL
- Unless this is your field of research, don't bother yet.
- DAML+OIL support is scattered; OWL is unfinished.
- For many applications OWL will always be irrelevant.

Why hasn't RDF caught on yet?

Major problems:

Harder to learn and to start out than the Web was.
There's no obvious killer application.
Businesses have little incentive to publish RDF data.
Much W3C advocacy makes the Semantic Web sound too futuristic.

Why hasn't RDF caught on yet?

Minor problems:

The RDF Core spec is hard to read and really boring.
- An updated spec is in preparation, broken into several documents.
- One new document is an RDF Primer that's quite good: www.w3.org/TR/rdf-primer/
Introductory tutorials are few.
RDF software supports the basics, but higher-level interfaces are still being figured out.

Starting Small

But we don't need to aim for the stars. Simple things can be done without much effort, and can still be useful:

Do you have some data that can be published as RDF?
- In addition to publishing HTML pages, publish an RDF version.
Write Semantic Web-enabled applications:
Can you link two different data sets to do something interesting?

On the other hand, there are signs of life: the DAML crawler (www.daml.org/crawler/) found 18,500 pages containing RDF in May 2002, and 706,821 pages as of today.

So the Semantic Web is coming... slowly.

Questions, comments?

These slides: www.amk.ca/talks/semweb-intro

RDF Introduction

PyCon -- March 28, 2003

Introduction to the Semantic Web and RDF

Overview of the Semantic Web

Layers of the Semantic Web

Overview of RDF

Motivating Example: Book Reviews

Motivating Example: RFC-2822 Approach

Example: RFC-2822: Second try

An Example RDF Graph

RDF Graph: Resources

RDF Graph: Literals

RDF Graph: Properties

RDF Graph: Property URIs

RDF statements and triples

RDF syntaxes: RDF/XML

RDF syntaxes: Notation-3 (or N3)

RDF's Sins and Virtues

Available RDF Software

Example code: Initializing an RDF database

Example code: Modifying the database

Example code: Loading RDF data

Example code: Querying the database

Example code: Querying II

Versa: An RDF query language

Larger example: Making an RSS file

RSS file (cont'd)

RDF Schema

RDF Schema: Classes

RDF Schema: Declaring an instance

rdfs:Property

Web ontology: Describing vocabularies

Web ontology: DAML+OIL and OWL

OWL: Class declaration

OWL: Property declaration

Web ontology: What's the point?

What should you care about?

Why hasn't RDF caught on yet?

Why hasn't RDF caught on yet?

Starting Small

Questions, comments?