Introduction to the Semantic Web and RDF

PyCon
March 28, 2003

A.M. Kuchling
www.amk.ca
amk @ amk.ca

Overview of the Semantic Web

The Semantic Web has been a W3C project since around 1999.

The existing Web of HTML documents is good for humans:

The Semantic Web will augment the existing human-readable Web with structured data that's easy for software to process.

Layers of the Semantic Web

The current architecture for the Semantic Web is split into three layers:

From lowest to highest:

  1. Resource Description Framework (RDF): lets you assert facts
    e.g. person X is named "Drew".
  2. RDF Schema: lets you describe vocabularies and use them to describe things
    e.g. person X is a LivingPerson.
  3. Web Ontology Language (OWL): lets you describe relationships between vocabularies
    e.g. persons in schema A are the same thing as users in schema B.

Overview of RDF

The Resource Description Framework (RDF) is a generic format for metadata.

What's metadata? It's information about other data:

RDF is a specification that defines a model for representing the world, and a syntax for serializing and exchanging that model.

Motivating Example: Book Reviews

I have a bunch of book reviews, and I want people to be able to find them and use them in other applications.

Model:

The simple approach: a comma-separated file of ISBNs and URLs:

1-930110-11-1,http://example.com/rev1
0-471-21822-7,http://example.com/rev2

How do we incorporate title, author, and all that? Add more columns...

What if there are multiple authors?

Motivating Example: RFC-2822 Approach

Let's try a more self-describing approach:

ISBN: 1-930110-11-1
URL: http://example.com/rev1
Author: DuCharme, Bob
Author: Author2, A.
Title: XSLT Quickly
Publisher: Manning
Pages: 450pp

What if the author of a book has a web site?

ISBN: 1-930110-11-1
Author: DuCharme, Bob
Author-URL: http://www.snee.com/bob/
Author2: Author2, A.
Author2-URL: http://a2.example.com/
  ...

This gets the model slightly wrong: the web site is a property of the author, not the book.

Example: RFC-2822: Second try

So let's track reviews and authors in separate files.

reviews.dat authors.dat
ISBN: 1-930110-11-1
Author-ID: 0042
...
Author-ID: 0042
Name: DuCharme, Bob
URL: http://www.snee.com/bob/

Author-ID: 0043
Name: Undset, Sigrid 

...

What if we wanted to store information about publishers, or author web sites?

An Example RDF Graph

In the RDF view of the world, everything is a graph.

[Example RDF graph]

RDF Graph: Resources

Resources are identified by URIs

[Example RDF graph highlighting resources]

RDF Graph: Literals

[Example RDF graph highlighting literals]

RDF Graph: Properties

[Example RDF graph highlighting property arcs]

RDF Graph: Property URIs

How are properties identified? They could be just names or serial numbers, but that wouldn't be very scalable.

Instead, properties have URIs just like resources.

RDF statements and triples

Definition: RDF statements are always
(subject,property,object) 3-tuples.

Subject Property Object
http://example.com/rev1 rev:subject →
http://amk.ca/xml/review/1.0#subject
urn:isbn:1930110111
urn:isbn:1930110111 dc:title →
http://purl.org/dc/elements/1.1/title
"XSLT Quickly"
urn:isbn:1930110111 dc:creator →
http://purl.org/dc/elements/1.1/creator
http://example.com/author/0042
http://example.com/author/0042 FOAF:surname →
http://xmlns.com/foaf/0.1/surname
DuCharme
http://example.com/author/0042 FOAF:homepage →
http://xmlns.com/foaf/0.1/homepage
http://www.snee.com/bob/
http://example.com/author/0042 FOAF:pastProject →
http://xmlns.com/foaf/0.1/pastProject
urn:isbn:1930110111

RDF syntaxes: RDF/XML

RDF Core defines an XML-based serialization for RDF.

<rdf:RDF 
    xmlns:FOAF="http://xmlns.com/foaf/0.1/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:rev="http://amk.ca/xml/review/1.0#">

    <!-- Implies rdf:type property is rev:Review -->
    <rev:Review rdf:about="http://example.com/rev1">
        <rev:subject rdf:resource="urn:isbn:1930110111"/>
    </rev:Review>

    <rdf:Description rdf:about="http://example.com/author/0042">
        <FOAF:firstName>Bob</FOAF:firstName>
        <FOAF:homepage rdf:resource="http://www.snee.com/bob/"/>
        <FOAF:pastProject rdf:resource="urn:isbn:1930110111"/>
        <FOAF:surname>DuCharme</FOAF:surname>
    </rdf:Description>
</rdf:RDF>

RDF syntaxes: Notation-3 (or N3)

An informal syntax that's easier to read and easier to scribble.

@prefix rev: <http://amk.ca/xml/review/1.0#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix FOAF: <http://xmlns.com/foaf/0.1/> .

<http://example.com/rev1> rev:subject [
   = <urn:isbn:1930110111>;
   dc:title "XSLT Quickly";
   dc:creator <http://example.com/author/0042>;
   dc:publisher "Manning" ] .

<http://example.com/author/0042> 
    FOAF:firstName "Bob";
    FOAF:surname "DuCharme";
    FOAF:homepage <http://www.snee.com/bob/>;
    FOAF:pastProject <urn:isbn:1930110111> .

RDF's Sins and Virtues

Virtues:

Sins:

Available RDF Software

The most basic form of RDF software is simply an RDF parser. Parsers are available for most of the languages you might need:

Example code: Initializing an RDF database

Here's a Python example using rdflib 1.2. (www.rdflib.net)

First, create an InformationStore, which is a database of triples. This store will be stored as a BerkeleyDB database, so it'll be persistent.

from rdflib.InformationStore import InformationStore

# Create an InformationStore to hold the RDF data
store = InformationStore()
store.open('/tmp/temporary-store')

Example code: Modifying the database

You can add triples to InformationStores:

from rdflib.URIRef import URIRef
from rdflib.Literal import Literal

DC_TITLE = URIRef('http://purl.org/dc/elements/1.1/title')
REVIEW_SUBJECT = URIRef('http://amk.ca/xml/review/1.0#subject')
book_uri = URIRef('urn:isbn:0609602330')

store.add((URIRef('http://www.amk.ca/books/h/Isaacs_Storm.html'), 
           REVIEW_SUBJECT,
           book_uri
           ))

You can also remove a triple:

store.remove((URIRef('http://www.amk.ca/books/h/Isaacs_Storm.html'), 
              REVIEW_SUBJECT,
              book_uri
              ))

Example code: Loading RDF data

You can add the contents of a URL, parsing the data as RDF/XML:

store.load('http://www.amk.ca/books/h/Bridal_Wreath.rdf')

When you load a file of data, a ContextStore object is returned. This can be used later to delete assertions that came from this file.

# Load a file
context = store.load('http://www.amk.ca/books/h/Bridal_Wreath.rdf')

# Remove the triples added from this file
store.remove_context(context.identifier)

Example code: Querying the database

The most general query method is triples(), which takes a (subject, property, object) 3-tuple, returning an iterator over the matching triples.

For example, to list all things which have a dc:title property:

>>> DC_TITLE = URIRef('http://purl.org/dc/elements/1.1/title')
>>> for s,p,o in store.triples((None, DC_TITLE, None)): 
...     print s,p,o
...
urn:isbn:0609602330 http://purl.org/dc/elements/1.1/title \
   Isaac's Storm
urn:isbn:1930110111 http://purl.org/dc/elements/1.1/title \
   XSLT Quickly
>>>

Example code: Querying II

Once we've parsed a pile of RDF about reviews, we can find all reviews of a particular book:

REVIEW_SUBJECT = URIRef('http://amk.ca/xml/review/1.0#subject')
t = (None, REVIEW_SUBJECT, URIRef('urn:isbn:1930110111'))
for subj,pred,obj in store.triples(t):
    print subj

For a fancier query, you have to implement the logic in Python. To list all reviews of books by an author with a particular name:

# Loop over all the books whose dc:creator 
# property is the right value
for book, pred, obj in store.triples((None, 
                                      DC_CREATOR, 
                                      Literal("DuCharme, Bob"))):
    ##print 'Book URI:', book
    t = (None, REVIEW_SUBJECT, book)
    for review, pred, obj in store.triples(t):
        print 'Review URL, reviewed book URI:', review, book

Versa: An RDF query language

Versa (uche.ogbuji.net/tech/rdf/versa/) is a query language for searching RDF models.

The following query lists all reviews of books by an author with a particular name (probably not optimally):

((all() |- foaf:name -> eq("DuCharme, Bob", .))
   <- dc:creator - true)
       <- rev:subject - .

Larger example: Making an RSS file

RSS = Really Simple Syndication.

def update_rss (entries):
    u = URIRef

    # Describe the channel
    ts = TripleStore()
    channel_uri = u(BASE_URL)
    ts.add(channel_uri, TYPE, u(RSS_URI+'channel'))
    ts.add(channel_uri, u(RSS_URI+'title'), 
           Literal(WEBLOG_TITLE))
    ts.add(channel_uri, u(RSS_URI+'link'), 
           Literal(BASE_URL))
    ts.add(channel_uri, u(RSS_URI+'description'), 
           Literal(WEBLOG_DESCRIPTION))

    # Create an RDF Sequence (doing it by hand -- ick!)
    seq = BNode()
    ts.add(seq, TYPE, SEQ)
    ts.add(channel_uri, u(RSS_URI + 'items'), seq)

RSS file (cont'd)

    # Add items to the Sequence    
    for index, (e_date, entry) in enumerate(entries):
        url = u(entry.permalink)
        title = entry.title
        e_date = e_date.date()

        ts.add(url, TYPE, u(RSS_URI+'item'))
        ts.add(url, u(DC_URI + 'title'), Literal(title))
        ts.add(url, u(DC_URI + 'date'),
               Literal(e_date.isoformat()))
               
        # Add item URL to sequence: the property names are 
        # ...#_1, ...#_2, ...#_3, ...
        ts.add(seq, u(RDFNS + '_' + str(index+1)), url)
        
    # Write the RSS file to a file
    ts.save("index.rss")

RDF Schema

With RDF, we can refer to resources and list a bunch of their properties. But how do we know when a resource is a review?

We need a way to say "Resource X is of the class Review." This can be expressed as a triple:

(Resource, rdf:type, class-URI)

Therefore, RDF classes are described in RDF. (Gets a bit head-bending at times...)

RDF Schema: Classes

Here's an example in N3 that declares two classes.

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

# Declare a Review class

rev:Review  # class URI: http://amk.ca/xml/review/1.0#Review
   rdf:type rdfs:Class ;
   rdf:ID "Review" ; 
   rdfs:comment """Reviews are resources that express an opinion 
about some other resource.""" ;
.

# Declare a subclass of Review.
rev:ComparativeReview a rdfs:Class ;
   rdfs:subClassOf rev:Review .
   rdfs:comment """Comparative reviews examine multiple resources,
comparing their relative merits and usually offering an opinion
about which one is the best.""" ;

RDF Schema: Declaring an instance

To declare that a particular resource is a rev:Review, assert that the resource's rdf:type property is the class:

# Declare a resource    
<http://example.com/rev1> rdf:type rev:Review .

rdfs:Property

Once classes have been defined, you can also list their properties. The following fragment defines the rev:subject property:

rev:subject rdf:type rdf:Property;
    rdfs:label "Subject property" ;
    rdfs:domain rev:Review ;
    rdfs:range rdfs:Resource ;
    rdfs:comment "Value is the resource being reviewed." ;
    .

Web ontology: Describing vocabularies

With RDF Schema, we know:

We don't know:

Web ontology: DAML+OIL and OWL

OWL is a W3C language for defining this sort of relationship, currently still in the rough draft stage.

It's based on two existing research languages, the American DAML (DARPA Agent Markup Language) and the European OIL (Ontology Inference Layer).

OWL: Class declaration

Here's an OWL declaration of a class representing persons.

owl:Class is a subclass of the RDF Schema rdfs:Class, so software that's RDF Schema-aware but not OWL-aware can still work just fine.

@prefix gen: <http://genealogy.example.com/schema#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

gen:Person a owl:Class;
    rdf:ID "person" ;
    rdfs:comment "Resource representing a person." ;
    .

OWL: Property declaration

owl:TransitiveProperty is a subclass of owl:ObjectProperty which is a subclass of rdfs:Property.

gen:ancestor a owl:TransitiveProperty;
    rdfs:domain gen:Person;
    rdfs:range gen:Person;
    .

Web ontology: What's the point?

DAML+OIL and OWL add two more pieces of the final puzzle, one simple and one complicated:

The simple part: OWL adds the ability to indicate when two classes or properties are identical.

The complex part: OWL declarations provide additional information to let rule-checking and theorem-proving systems work with RDF data.

What should you care about?

So how much of this stuff do you need to learn about and use?

Why hasn't RDF caught on yet?

Major problems:

Why hasn't RDF caught on yet?

Minor problems:

Starting Small

But we don't need to aim for the stars. Simple things can be done without much effort, and can still be useful:

On the other hand, there are signs of life: the DAML crawler (www.daml.org/crawler/) found 18,500 pages containing RDF in May 2002, and 706,821 pages as of today.

Questions, comments?

These slides: www.amk.ca/talks/semweb-intro