The MEMS Exchange Architecture

A.M. Kuchling

MEMS Exchange
CNRI
1895 Preston White Drive, Suite 100
Reston, VA
20191-5434



Contents

Abstract:

The MEMS Exchange project is an experiment in distributed microfabrication that needs to take advantage of the Internet in order to function. In developing the required custom software, the MEMS Exchange software development team faced a complex task with no clear way to figure out our requirements ahead of time. The team therefore selected or wrote tools that are simple yet still powerful and flexible, and also adopted a flexible and free-flowing methodology using some of the principles of agile programming. This paper will discuss the tools we've built and the techniques we use.

1 Introduction to the MEMS Exchange

The MEMS Exchange is a research project to create a distributed fabrication network for MEMS devices. MEMS stands for Micro-Electro-Mechanical Systems, and is concerned with building microscopic devices to perform some task. MEMS are constructed in a fashion similar to how integrated circuits are built, using many of the same techniques: depositing material onto silicon wafers, etching away material, implanting ions to change the silicon's electrical characteristics, imprinting patterns on wafers, and so forth. A critical difference between integrated circuits and MEMS is that ICs are purely electrical devices with no moving parts, while MEMS often contain moving parts such as sensors, actuators, gears, or mirrors. For example, the accelerometers used to trigger car airbags are MEMS devices, as are various microflow controllers used in medical applications.

Making integrated circuits is simpler than constructing a MEMS device, because ICs are usually made with one of a small number of standard process sequences. For example, the sequence of operations in the CMOS process sequence is pretty much identical across different manufacturers. The MEMS field, however, is much less standardized, and the process sequence will vary significantly depending on the device being constructed. Also, MEMS process sequences can use exotic processes that aren't relevant to IC manufacturing, such as deep reactive ion etching, wafer bonding, and LIGA. (Don't worry about what those terms mean; this is a paper about software, not MEMS technology.) Because MEMS process sequences may use exotic processes, it's impossible for a single chip factory (or "fab") to have every single process available. This specialization can make it hard to develop MEMS, because you have to find a fab that can carry out all the steps in your proposed process sequence; if you have several out-of-the-ordinary steps, there may not be a single fab that has them all.

The MEMS Exchange acts as a front-end for many fabs, so that different process steps in a process sequence can be performed at different fabs. For example, the first step can be done at fab A, the second step at fab B, the third step back at fab A, and so forth, by simply carefully packing the partially manufactured wafers and shipping them via courier from from one fab to the next. The manufacturing process needs to be carefully tracked, so as steps are performed, the fab marks each step as completed and enters metrology results, which serve as a check on the accuracy of the processing. For example, if your process sequence calls for depositing 1 micron of silicon nitride on a wafer, the technician will measure the actual result after performing the step, which might turn out to be 0.98 microns. If you later find your devices don't work, metrology information can be helpful in trying to debug the process sequence.

The complexity of the task lies mostly in keeping track of all the jobs in progress. Our goal is to do as much as possible over the Internet; users can sign up on the Web and submit process sequences to us using a Web-based tool. We then sanity-check the process sequence, send it for approval by the fabrication sites, and then coordinate the performance of each step in the sequence. Users can also inspect their wafers using a microscope remote-controlled over the Internet, and fab staff can enter their metrology results. All of this requires software.

2 Architecture Overview

The overall architecture follows the conventional N-tier architecture. Objects within our application domain, such as process runs and metrology results, are represented by objects that live within a database. User interfaces and presentations are layered on top of the core objects. Currently we only maintain a Web-based interface layer; it would certainly be possible to use a GUI toolkit such as Qt or Tk, but so far there's been no compelling reason to do that.

The architecture becomes less conventional in its choice of implementation technologies. Python is our primary programming language, and the database is the Z Object Database, so our database is a collection of persistent Python objects instead of a set of relational tables. We've written our own Web application server, called Quixote, and we've also written a type checker and unit testing framework. Each of these components will be discussed in the following sections.

This system is much clearer and easier to understand than any of our previous efforts, and the code is more maintainable. This gives me some confidence that we can readily modify the system in future to meet new requirements.

3 The Medium: Using the Web

The highest-level question was whether to write a specialized application for using the MEMS Exchange, or to use the Web. We initially chose the wrong answer to this question when we began building an important component of the MEMS Exchange system called the Run Builder.

The Run Builder is an interface for assembling a process sequence by browsing through our catalog of available process capabilities, selecting a particular capability, and entering any required parameters. For example, the user could select a ``Silicon nitride PECVD'' process to deposit silicon nitride onto their wafer and then enter the desired thickness of the nitride layer. This value is usually limited to a particular range of values; for example, at fab X the legal thickness might be between 0.1 and 0.8 micrometres. Once the user is happy with their process sequence, they can save it and submit it to staff at the MEMS Exchange for review. If the sequence looks reasonable, then our staff will approve it and send the run on to the fabs for manufacturing.

We wrote the first Run Builder application using Jython and the Swing GUI classes for Java. Greg Ward wrestled InstallShield for quite a while in order to package the application and reduced installation to two downloads, one for the Java runtime and one for our application code. Speed was a problem, as the Run Builder was annoyingly slow to start up, taking about 30 seconds even using the latest 1.2 JDK available at the time, but we believed the UI flexibility possible with a full-blown application would let us compensate for this irritation. At this point we believed that the application would grow into a complete environment for using the MEMS Exchange. Someday, we dreamed, we'd build additional components for checking the status of a customer's runs, their payment status, editing mask layouts, viewing simulations, or anything else we came up with. In this view of the world, customers would download the MEMS Exchange application and then spend a sizable chunk of their day inside it.

This vision ran into one small snag: no one ever used the Run Builder. We badgered a few staff members at our affiliated fabs into downloading it and trying it out in order to do some usability testing, and even offered a 10% discount on runs submitted using the Builder, but customers just ignored it and continued sending in their process sequences through e-mail. It was easier for them to type everything out in their familiar e-mail software than to download an application and get it running.

At this point we decided to try making a Web-based Run Builder, and chose to build it using the Zope application server. We completed a working implementation in around 3 months. And, magically, this time people actually used it. The Zope-based version went live on December 18, 1999, and the first customer run was submitted through it 3 days later. The Zope implementation was used for several months, and over the next year roughly 400 runs were submitted using it.

The important lesson from this experience: Web-based interfaces, for all their many flaws, are preferable to a fancier interface that requires downloading additional software. People prefer to stay in their browser, which they use for several hours a day and know well, rather than try to figure out some new application.

4 The Application Server: Quixote

The earliest version of the site was written using Java servlets, and the Run Builder used Zope, but we were dissatisfied with both these options. Developing Java servlets was tediously slow, as it called for a repeated cycle of editing Java source code and recompiling it using the sluggish javac compiler. Testing the change often required restarting the Java process to pick up the recompiled class files,

We tried Zope next, and found it much easier to produce simple forms and reports using DTML, Python code, and SQL Methods than with Java and JDBC. However, we weren't that happy with Zope for building a more complex interface such as that of the Run Builder. There were a few problems that we ran into. The assembly of ExternalMethods and DTML Methods that made up the Run Builder was hard to understand and to debug, as we discovered when trying to fix bugs reported by customers. We missed CVS a great deal, because when you were trying to add a feature and inadvertently broke something, there was no way to pin down the problem by seeing what you'd changed since the last working version. Editing DTML code in Netscape's TEXTAREA fields was also quite cumbersome.

After looking at some other systems, we decided none of them really fit our picture of the ideal system, and therefore we set out to design and implement a small and simple framework for Web application development. The framework was given the name Quixote; this name was chosen as a joke, because we thought developing a new Web application framework in the year 2000 was definitely tilting at a windmill.

The design goals of Quixote were laid out at the beginning and are still used as guidelines for design enhancements. The goals were, and are:

  1. To make the environment, and even the HTML templating language, as similar to Python as possible. The idea is to apply as many of the skills and structural techniques used for writing regular Python code to the difficult task of building Web applications.

  2. To focus on the development of Web applications where the accent is more on complicated programming logic than complicated templating. Our graphic design is mostly done by programmers, so it's quite simple. The complexity of the system lies mostly in the underlying model, not the interface. This is now changing as we try to build more elaborate interfaces, but Quixote has proved able to cope with them.

  3. To avoid magical behaviour. When it's not obvious what to do in a certain case, Quixote stubbornly refuses to guess. Instead, it either does the same thing in all cases or raises an error, forcing the caller to be explicit. This design principle came from Python itself.

Another initial goal was to make Quixote simple enough and small enough to be implementable in a week or two. In fact the first version was implemented that quickly, though subsequently it underwent a good deal more rewriting as we needed new features.

Comparing Zope and Quixote on one particular point will illustrate the principle of avoiding magical behaviour. In Zope a method is considered public and callable by external clients if it has a docstring; methods without a docstring are considered to be private. This isn't obvious by looking at the code from some random Zope product -- you have to read Zope's documentation, and programmers just hate doing that -- and it prevents you from documenting your private methods with a docstring. The relationship between public/private status and possession of a docstring isn't obvious, and it is, in our jargon, ``magical''. Quixote requires you to be explicit; a _q_exports variable lists all the methods that are public. Methods not listed in _q_exports are private. This rule is easy to explain, easy to implement, and easy to use in a Quixote-based application. It's a bit verbose, as our modules need to have _q_exports declarations in them, but that's the price of clarity.

Quixote puts all the code and HTML for a Web-based application into a Python package containing .py and .ptl files, and provides a simple framework for publishing code and objects on the Web. For example, the MEMS Exchange interface lives in a package nmed mems.ui. Once our Web server has been configured to direct all requests to Quixote, accessing http://fab/ will run the function mems.ui._q_index. Accessing http://fab/run/ runs mems.ui.run._q_index, while http://fab/run/1/detailed retrieves an object representing run #1, and calls its detailed() method.

PTL, Python Template Language, is used to mix HTML with Python code. An import hook is defined so that PTL files can be imported just like Python modules. The basic syntax of PTL is the same as Python's, with a few small changes, as seen in this example:

template barebones_header(title=None,
                          description=None):
    """
    <html><head>
    <title>%s</title>
    """ % html_quote(str(title))
    if description:
        '<meta name="description" content="%s">' % html_quote(description) 

    '</head><body bgcolor="#ffffff">'

Templates in PTL are analogous to Python's functions. Templates are defined using the template keyword, obviously, and they can't have docstrings for reasons to be explained in the next paragraph, but otherwise they follow Python's syntactic rules: indentation indicates scoping; single-quoted and triple-quoted strings can be used interchangeably; the same rules for line continuation apply,; etc. PTL also tries to follow the semantics of normal Python code as far as possible, so templates can have parameters, and these parameters can have default values, be treated as keyword arguments, and so forth.

The difference between a template and a regular Python function is that inside a template the results of expressions are accumulated, and the return value of the template is simply all those results concatenated together. Normally when Python evaluates an expression inside a function, its value is simply discarded, but in PTL the value is converted to a string using str() and appended to the output for the template. The initial string in a template isn't treated as a docstring, therefore, but is just incorporated in the generated output; therefore templates can't have docstrings. (We tried to figure out a way to allow docstrings, but guessing when a string was a docstring and when it was HTML text went against the ``no magical behaviour'' rule. Assuming the first string literal was always a docstring would mean that sometimes we'd have to write content-free docstrings just to fit that convention. In the end we decided to live without docstrings in PTL code.)

Inside a PTL template, you can use all of Python's control-flow statements:

template numbers(n):
    for i in range(n):
        i
        " " # PTL does not add any whitespace

numbers(5) will produce the output "1 2 3 4 5 ". You can also have if or try...except statements inside a template, just as in Python code.

This syntax is quite different from most HTML templating languages, which usually look like HTML with magical delimiters (e.g. <%...%>, <?...?>, <!--#... -->) sprinkled throughout. After we'd been using the Python-inspired PTL syntax for a while and found we quite liked it, Neil Schemenauer came up with the following explanation for why it works so well. Most templating languages assume the content is mostly HTML with just a few small bits of program code here and there. The presence of program code is therefore signalled by some escape sequence from normal HTML, <%...%> or whatever. This works nicely if the amount of HTML dwarfs the amount of program code. However, if you're following good software design principles, then as you develop the application you're also factoring out repeated code into functions. For example, instead of coding 10 rows of a table as HTML, you'll write a loop that repeats a single row 10 times. If you continually refactor in this way, the amount of HTML code shrinks, getting replaced by invocations of common functions and templates until the density of the escape sequence in the source code is unreadably high.

Without us ever realizing this while designing it, PTL turns the idea of HTML templating on its head: the default is program code, and there's an escape sequence into HTML. The escape sequence is just Python's notation for string literals, which is a compact and easily readable notation. Functions that actually contain HTML therefore can look a little messy, but most HTML only appears e lowest-level functions; the bulk of our PTL code simply calls other PTL templates and inserts the odd "<br>" or "<table>" here and there.

Quixote is free software, and can be downloaded from the MEMS Exchange Web site. Three early releases have been made so far, and a fourth is in preparation. Previously we haven't been actively encouraging other developers to use Quixote, but with the upcoming release, this is likely to change. At this point we've used Quixote for several different tasks, giving us more confidence that the software is stable, and have written some user-friendly documentation for it. It should now be straightforward for other groups to start using Quixote for their own Web programming needs.

4.1 Related Links

http://www.mems-exchange.org/software/quixote/: The Quixote home page. The current release is version 0.3.

http://www.amk.ca/python/writing/why-not-zope.html: A more detailed (and controversial) explanation of why our team decided to stop using Zope and develop Quixote.

http://www.arsdigita.com/asj/application-servers: "Scalability, Three-Tiered Architectures, and Application Servers", by Philip Greenspun, looks at Netscape's Kiva application server and shows why interpreted languages are better choices for Web applications.

5 The Database: ZODB

The early versions of our site used a relational database (Solidtech's Solid SQL Server) to store the data. Over time it became increasingly clear that this was a bad design decision. As more runs were submitted, we discovered that Solid's query optimizer was weak and some of the queries turned out to scale poorly, taking more and more time to complete as more rows were added to the SQL tables. By the time the system was replaced, certain reports would take over 30 seconds to be generated!

We briefly considered switching to a different SQL database, thinking that perhaps Oracle or PostgreSQL had a smarter optimizer and would perform better, but it seemed clear that relational databases weren't very well suited for our problem. In our model, a process run is represented as a constellation of around a hundred object instances. Sets of nested objects map to multiple SQL tables, and reading in a single object can require several queries and joins to read all of its contents.

Automated object/relational mapping would have saved us from writing repetitive code, but we didn't think it would solve the problem of speed because the multiple SQL queries required to read or write a single object are unavoidably slow. The only obvious way to make it faster would be to cache objects in memory and skip the SQL queries when possible, but this seemed like a difficult job -- almost a step toward implementing our own database system, really -- and we were uneasy about debugging our code using irreplaceable customer data. Commercial object/relational mappers are available, but we didn't know of any Python-specific ones.

We therefore turned to considering object database systems, and ended up looking at three systems: ZODB, Poet, and Versant. The Z Object Database, or ZODB, is part of Zope, and is a persistent object system for Python objects. POET and Versant are commercial object databases, both supporting Java and C/C++. We compared them by trying to implement a miniature version of the MEMS Exchange database schema. The ZODB implementation was used for a pure Python version, which was straightforward.

Surprisingly, the commercial products were much less flexible. Because the commercial databases didn't support Python, we wrote a set of toy Java classes, planning to glue them together using Jython. On trying to make our toy classes persistent using the commercial products, we discovered we'd have to make sizable changes to the class interfaces just to make them compatible with the product's limitations. Worse still, POET and Versant needed differing sets of changes because object databases aren't standardized very well. This made it clear that choosing a database would be a permanent decision; changing to another database would require pretty much a complete rewrite. The ZODB also requires some code changes to make an existing Python class support persistence, but crucially, that can be done without modifying the class interface too much. We therefore chose the ZODB.

The lowest layer of the ZODB is the Storage interface. Classes that implement Storage handle storing and retrieving objects from some form of long-term storage medium, usually a disk file. A few different types of Storage have been written, such as FileStorage which uses regular disk files and BerkeleyStorage which uses Sleepycat Software's BerkeleyDB database. We use FileStorage because it's the oldest and most tested storage supported by the ZODB, and it's been adequate for our needs so far.

ZEO, Zope Enterprise Objects, consists of about 1400 lines of Python code on top of the ZODB. The code for ZEO is relatively small because it contains only code for a ZEO server and for a new type of Storage, ClientStorage. ClientStorage doesn't use disk files at all; it simply makes remote procedure calls to a ZEO server which then passes them on to a regular Storage class such as FileStorage. Unlike FileStorage, which can only be opened by a single process, multiple processes can connect to the same ZEO server and update its contents simultaneously. We need to be able to have the live site up and running while performing other tasks such as running reporting scripts or modifying the database manually, so ZEO is absolutely necessary for our needs.

To make a Python class persistent, it must inherit from the ZODB.Persistent class. We've written an MXBase class that provides a number of convenience methods, and an MXPersistent class that derives from both MXBase and ZODB.Persistent. Most of our classes derive from MXPersistent.

(MX is the acronym we use internally for ``MEMS Exchange.'' The MXBase and MXPersistent classes have no relationship to the mx Extensions for Python available from http://www.egenix.com. We use mxDateTime, part of the mx Extensions, throughout our system for representing dates and timestamps.)

MXBase provides a standard __repr__ method, a dump() method to show an object's contents, and a copy() method to perform a deep copy of an object respecting its specification of which attributes should be shared among copies. MXPersistent currently adds no additional methods of its own.

The ZODB uses reachability to decide which objects are to be stored persistently. It starts from a single root object, a dictionary-like object whose contents are given object IDs and stored in the database. The attributes and/or contents of these objects are also stored, and so on recursively. The database essentially contains a directed graph consisting of all the objects reachable from the root. How do we impose a structure upon this graph?

Inside our ZODB there are a number of different singleton objects that are responsible for managing a collection of objects, and are generally named XYZDatabase. For example, an instance of the RunDatabase class contains all the runs and run-related objects in the MEMS Exchange system, a UserDatabase instance holds all the User and Group objects, and so forth. These database objects generally have methods for adding newly created objects, getting a list of all the contained objects, and for retrieving a particular object based on its unique ID. As an example, the skeletal definition of the UserDatabase class is:

class UserDatabase(MXPersistent):
    """Class to hold all users and groups in the system.  User and group
    IDs are always looked up in the user database, so you will generally
    not be able to use a user or group until they have been added to the
    user database.  Normally, there is only one UserDatabase instance,
    and you don't create it directly -- instead, use the
    'get_user_database()' function provided by mxtoy.virtual_fab to
    create or find the UserDatabase instance.

    Instance attributes:
      users : BTree { user_id:string : user:User }
        all known users
      groups : BTree { group_id:string : group:Group }
        all known groups

    """

    def clear (self): ...

    def get_user (self, user_id):
        """Return the User object with id 'user_id', or None if no such
        user."""

    def get_group (self, group_id):
        """Return the Group object with id 'group_id', or None if no
        such group."""

    def add_user (self, user):
        """Add User object 'user' to the user database.  Raises
        MXConflictError if there is already a user with the same user
        ID.
        """

    def delete_user(self, user):
        """delete_user( user : User)

        Delete User object from the database.
        """

    def add_group (self, group):
        """Add Group object 'group' to the user database.  Raises
        MXConflictError if there is already a group with the same group
        ID."""

    def get_users (self): ...
    def get_groups (self): ...

A common complaint about object databases is that they often don't support a query language. The ZODB doesn't have a query language, but that isn't really a problem because it's easy and quick to write Python code implementing a particular query. If this was a Java-based system, then writing Java code to implement queries would be annoying and tedious, but Python code is quite compact and can often be composed directly at an interpreter prompt. We've written a script called opendb that connects to our ZEO server and creates variables for all of the root databases, as well as a few convenience functions and variables.

For example, the following transcript shows how we can change the owner of a run. This task occasionally comes up, but not so often that it's worth building a Web interface for doing it.

ludwig akuchlin>opendb
root databases available:
  template_lib
  process_lib
  run_db
  user_db
  business_db
  results_db
  session_manager
  shared_cache
 
other variables and functions:
  database
  connection
  root
  commit() = get_transaction().commit()
  abort()  = get_transaction().abort()
  sync()   = connection.sync()

>>> r = run_db.get_run(113)
>>> r.owner
<User at 839e2f8: akuchlin>
>>> r.owner = user_db.get_user('gward')
>>> r.owner
<User at 83a0348: gward>
>>> commit()        # commit the transaction

Here's a slightly more complicated query to find all runs which use a particular process:

>>> for r in run_db.get_runs():
...     for i in range(len(r.sequence)):
...         step = r.sequence[i]
...         if step.get_process_id() == 'CNF-B-0019':
...             print r, 'step', i
...
R0172V6 step 0
R0287V25 step 14
R0331V14 step 0
R0331V14 step 1
R0376V5 step 1
R0419V2 step 1
R0481V4 step 0

A true declarative query language would make it possible to optimize queries by automatically using indexes or by implementing a more sophisticated traversal of the universe of objects. When using opendb, a developer will normally do a straightforward implementation, scanning through all the objects, resulting in $O(n)$ or $O(n^2)$ performance. This isn't much of a disadvantage for us, though, because we so rarely use queries; traditional SQL-ish queries such as ``Find me all the users in California who have more than 5 runs'' rarely come up in practice for us.

We've been quite pleased with how the ZODB has worked for us. It's been reliable and easy to develop with, and we'd like to encourage other groups to try it. In pursuit of that goal, I've packaged the ZODB code as a standalone package, and begun writing a ZODB programmer's guide.

5.1 Related Links

http://www.amk.ca/zodb/: My packaged version of the ZODB as a standalone distribution.

http://www.amk.ca/zodb/guide/: The current draft of a programmer's guide to the ZODB.

6 The Type Checker: Grouch

In a highly dynamic language such as Python, it's possible that a typo can result in code that executes without an error but silently produces incorrect results. The most common example is a typo in an attribute assignment.

class C:
    def __init__(self):
        self.template = None

    def method(self):
        # Note typo in the next line
        self.templtae  = ParameterTemplate(...)

No error is reported when method() is called, but it assigns a value to an incorrect attribute, and the template attribute ends up always being None.

We tackle this problem in two ways: through coding conventions and through a software tool. First, let's look at the coding conventions. The convention is that attributes can be accessed directly, but they should always be set using a modifier method. In the above example, we would write a set_template() method:

from mems.lib.typeutils import typecheck_inst
class C:
    def set_template(self, template):
        typecheck_inst("template", template, 
                       ParameterTemplate)
        self.template = template

A typo in an attribute access (e.g. obj.templtae) will cause Python to raise a NameError exception if the attribute doesn't exist. A typo in the name of a modifier method (e.g. obj.set_templtae(t)) will also raise a NameError, so we just have to be sure that there aren't any typos in the definition of the set_template() method. Such methods are usually small and easily checked by inspection, and writing a two-line unit test for the method makes it impossible to get it wrong.

Another, more subtle, potential problem is assigning an incorrect type to an attribute. In the above example, the template attribute should be either None or a ParameterTemplate instance, but Python provides no way to enforce this; you could pass an instance of a different class, or even any other Python object such as a file object or a string. When an object is being stored persistently, such a type error on an attribute might only discovered long after its introduction, when other software accesses the attribute and finds it to be of an incorrect type. Finding the cause of the incorrect assignment would then be difficult, because the bug could be hiding anywhere in our system.

We do two things to alleviate this problem. The first and simplest preventative measure is to check types before assigning to an attribute. Because we use modifier methods, the type check can be done in exactly one place, so we don't need to scatter type checking code throughout the code. The typecheck_inst() function in the above example is a utility function in our standard library that checks if an instance is of the specified class, and raises a TypeError if the instance is of the wrong class.

Explicitly coded typechecking tries to prevent errors from being made, but we also want to detect errors that slip through, in case someone forgets to code an explicit check. It could also happen if the incorrect attribute assignment was performed by code in Python's standard library or a third-party extension that we use, such as the ZODB or the PyXML package. An incorrect assignment might happen because the external module has a bug, or, more likely, if the module is being used incorrectly by our code.

To check for this, Greg Ward wrote an after-the-fact typechecking tool called Grouch. (Originally it was called Oscar, but there are already several other pieces of software with that name. The name was chosen because the software will grouch at you when it notices a problem.)

Grouch works by parsing sets of type declarations embedded in class docstrings. The declarations look like this:

class PhysicalValue (MXBase):
    """A physical value: a number (or range value) tied to a physical
    unit that automatically performs unit conversion when arithmetic
    operations are performed.

    Instance attributes:
      value : float | RangeValue
        the numeric part of the physical value
      unit : PhysicalUnit
        the unit part (may be None for inherently unitless values, in
        which case the instance just acts like a number)
    """

The type declaration language is straightforward. In the above example, the value attribute must be either a Python float or an instance of the RangeValue class, and unit must be an instance of the PhysicalUnit class. A script called gen_schema.py crawls through a set of directories, parses all the Python files it finds, examines the class docstrings for declarations, and builds an object schema, writing it out to a file. A second script, check_db.py, loads a previously generated schema and traverses the graph of objects in our database, checking that each instance has all the expected attributes, that all the attribute values are of the right type, and that there are no extra attributes present that aren't specified in the schema.

We periodically run Grouch on our live site, in hopes of catching errors as soon as possible if they occur, and also on our staging server, in order to catch bugs in the development version of the code before it gets deployed on our live site. Grouch has caught bugs on our staging server, and so far hasn't detected any bugs on the live site, so presumably we've managed to fix all our bugs before deploying code (or any remaining bugs are very subtle, and haven't been triggered yet).

6.1 Related Links

http://www.mems-exchange.org/software/grouch/: The Grouch home page. Grouch is covered by a Python-style license, so it qualifies as free software.

7 The Test Suite: unittest

We also wrote a unit testing framework, mems.tools.unittest. Greg Ward wrote the original version, and its design was inspired by Kent Beck's Smalltalk testing framework. Neil Schemenauer added code coverage features using a modified version of Skip Montanaro's trace.py.

Here's an example of unittest in use. Consider a simple Python function f(s, val) that takes a string s and repeats it val times, but reports an error if val is negative. Here's the code for f():

 
def f(s, val):
    if val < 0:
        raise ValueError, 'val cannot be negative'
 
    return s * val

The test suite for this function might be written as follows:

 
from quixote.test.unittest import TestScenario, parse_args, run_scenarios

import module
 
tested_modules = ['module']
class MyFunctionTest (TestScenario):
 
    def setup(self): pass
 
    def shutdown(self): pass
 
    def check_val_param(self):
        "Test error checking for the val parameter: 3"         
        # Negative number should raise a ValueError
        self.test_exc( "module.f('', -1)", ValueError)
 
        # Zero is OK
        self.test_stmt( "module.f('', 0)")
 
        # Positive numbers are also OK
        self.test_stmt( "module.f('', 1)")
 
    def check_func(self):
        "Test the function's output: 6"
 
        # Test the null case (val == 0)
        self.test_val( "module.f('', 0)", '')
        self.test_val( "module.f('abc', 0)", '')
 
        # Test the identity (val == 1)
        self.test_val( "module.f('', 1)", '')
        self.test_val( "module.f('abc', 1)", 'abc')
 
        # Test a real case (val == 3)
        self.test_val( "module.f('', 3)", '')
        self.test_val( "module.f('abc', 3)", 'abcabcabc')
 
if __name__ == "__main__":
    (scenarios, options) = parse_args()
    run_scenarios (scenarios, options)

When run, this test case prints:

 
kronos /tmp>python test.py
FunctionTest:
  ok: Test the function's output ('func') (6 tests passed)
  ok: Test error checking for the val parameter ('val_param') (3 tests
passed) ok: 9 tests passed
kronos /tmp>

The unit test framework also supports measuring code coverage. The default argument parsing lets you add the -c switch to turn on code coverage, and adding -v produces a listing of the tested module highlighting lines that weren't executed. For example:

 
kronos /tmp>python test.py -c
MyFunctionTest:
  ok: Test the function's output ('func') (6 tests passed)
  ok: Test error checking for the val parameter ('val_param') (3 tests
passed) ok: 9 tests passed
code coverage:
  module: 100.0% (4/4)
kronos /tmp>python test.py -c -v
  ... additional output while running the tests deleted ...
ok: 9 tests passed
code coverage:
  module:
      .
    10: def f(s, val):
      9:    if val < 0:
      1:        raise ValueError, 'val cannot be negative'
      .
      8:    return s * val
  100.0% (4/4)

The number at left is the number of times each line was executed.

If you add an elif val == 42 branch to the 'if' statement, its block will never be executed by the tests, so the code coverage reports the unexecuted lines:

 
code coverage:
  module:
      .
    10: def f(s, val):
      9:    if val < 0:
      1:        raise ValueError, 'val cannot be negative'
      8:    elif val == 42:
  >>>>>>        print 'The answer!'
      .
      8:    return s * val
  83.3% (5/6)
kronos /tmp>

Armed with this information, you can now go back and add a test that will exercise the branch for val==42.

A separate script, run_tests, can run a single test by name or search for subdirectories named "test" and run all the tests they contain. This is useful for testing an entire source tree:

kronos proto3>~/src/mems/tools/run_tests.py -r .
looking for test scripts...found 45
ok: lib/test/test_pvalue.py: 118 tests passed
ok: lib/test/test_range.py: 129 tests passed
ok: lib/test/test_unit.py: 109 tests passed
...
ok: template/test/test_eqtemplate.py: 14 tests passed
ok: prc/test/test_inter.py: 0 tests passed
ok: 2595 tests passed
kronos proto3>

We can also run the tests to measure code coverage, so we can find out which modules aren't being tested very completely.

Code coverage:

  oscar.valuetype: 36.6% (168/459)
  mems.run.process_run: 43.1% (202/469)
  mems.tools.unittest: 45.4% (237/522)
  mems.lib.base: 46.9% (75/160)
  oscar.util: 48.1% (26/54)
  mems.process.process_module: 52.4% (100/191)
  ...

The ultimate in conciseness is achieved by adding the "-q" switch:

kronos proto3>python tools/run_tests.py -rq
ok: 2595 tests passed

We run this last invocation every night and mail the output to the development team. This lets us catch changes that accidentally break code within 24 hours of the errant change being checked in. This is much better than discovering bugs weeks later when the software has been deployed on our live site.

7.1 Related Links

unittest is included with the Quixote distribution, available from http://www.mems-exchange.org/software/quixote/.

http://www.xprogramming.com/testfram.htm: Kent Beck's original paper describing a Smalltalk unit testing strategy and framework.

http://pyunit.sourceforge.net/: Another Python unit testing framework, written by Stephen Purcell and now part of the Python standard library.

8 The Interface: Web UI Design

The final problem is that of building a coherent Web-based user interface. One of the problems with the old system was that it had been allowed to grow at random, with no single unifying style or metaphor. We wound up with a bunch of disjointed servlets, each performing a single task, but customers were never clear on how many different forms there were or how to find them. MEMS Exchange staff who had to use the software every day would eventually learn the various servlets and their specialized functions, but customers only use a Web site intermittently and can't be expected to learn its eccentricities.

We unified things by stealing the idea of a personalized page from Web portals such as Yahoo or Netscape's Netcenter. Everyone, whether customer, fab employee, or MEMS Exchange staff member, has a personalized page accessible as http://fab.mems-exchange.org/my/ that tracks all the process runs they need to worry about. Customers have a very simple page, listing the current status of all their process runs along with useful links to the process catalog, Run Builder, and a few other tools. Fabs have a different set of useful links and a slightly more complicated page, listing the runs they're currently working on and runs they're supposed to review. MEMS Exchange staff have the most complicated personalized page of all, listing runs in all stages of processing and with forms for viewing a user's registration information, a particular account, and other administrative functionality.

We've tried to follow good software development principles by factoring out common code as much as possible. A pleasant side effect of this is that it's easier to change the look of the site. For example, we need to display a user's identification in lots of different places, and this is neatly wrapped up by a format_username template. When we decided to make the user's e-mail address a mailto link, only format_username needed to be changed. Similarly, the display of process runs has been revised over the past year, and changing the basic templates modified both the Run Builder and the Run Tracker. Re-using the same basic set of templates ensures that the site's display remains consistent no matter how many different sub-applications there are.

URLs should also be considered as part of a Web site's user interface, meaning that they should be designed with some care, because users have become accustomed to editing the URL when they find themselves too deep within a site. The first version of the site had ugly URLs. For example, the URL for displaying a particular run was something like "http://.../ProcessRun?RUN_ID=500". We wanted to make a neater URL, "http://.../run/500/" possible, and with the addition of a single feature to Quixote, that proved fairly simple.

As Quixote traverses the package containing the user interface (mems.ui, in our case) it looks for a magic function or method named "_q_getname(request, component)". request is the HTTPRequest object representing the current HTTP transaction, and component is the next component of the URL path. To display runs, we have the following _q_getname():

def _q_getname (request, component):
    return RunUI(request, component)

Accessing the URL path "/run/500/" will cause this to be invoked with component set to "500". The constructor for the RunUI class then converts the component to an integer, does a few access control checks (omitted here for simplicity), and retrieves the corresponding run.

class RunUI:
    _q_exports = ['details', 'check', ...]

    def __init__ (self, request, component):
        run_id = int(component)
        run_db = get_run_database()
        self.run = run_db.get_run(run_id, run_version) 

    def _q_index (self, request):
        ... return index page ...

    def details (self, request):
        ... return a more detailed page ...

After Quixote has called _q_getname() to retrieve an object, it then continues traversing the object, so we can define any number of different methods on the RunUI instance. "/run/500/" will execute RunUI._q_index() and return its output, while "/run/500/details" will run RunUI.details(), which produces a more detailed display of a process run. This has become a common pattern in our interface: create a *UI class surrounding an object and then define methods for viewing the object or its revision history, editing it, changing its status, or whatever. Each method can implement its own access control rules, so some methods can be limited to MEMS Exchange staff or to fabs, and the code remains simple to read and audit.

8.1 Related Links

http://www.useit.com/alertbox/990321.html: Jakob Neilsen's AlertBox column on URLs, pointing out that URLs should be considered part of the site's interface and therefore need to be designed with care.

9 The Process: Agile Programming

In January 2000, I read [#!Extreme_Programming!#] by Kent Beck and was struck by how intuitively reasonable the described methodology seemed. Fired with enthusiasm, I encouraged all the other developers to read the book, and there was general agreement that we should adopt some aspects of the methodology described.

Extreme Programming and similar lightweight development methodologies can be grouped together under the term ``agile programming''. The basic idea of agile programming is to lighten the burden of the development process through making frequent small releases, reducing administrative overhead to a minimum, and constantly simplifying and tidying the code. Agile programming isn't suited for programming a system where the specification can be written out completely correctly in advance, but we certainly aren't working in an application domain where that's possible; new ideas and features are continually being suggested.

Agile programming consists of several related practices. We don't follow every single practice, but instead have adopted a subset that has proven quite productive and helpful. The practices we follow are:

Test first:
Always do extensive unit testing of code. Testing isn't left until after code is written because that makes it very easy to procrastinate and avoid writing the tests at all. A recommended practice is to write the tests first, even before writing any lines of code. This means that you have to figure out the class API in order to write the tests, and once the tests run to completion, you can be fairly sure that the class's implementation is correct, at least as far as the tests exercise it. As explained in the earlier section on quixote.test.unittest, we have many tests for our basic objects, and we run all the tests nightly and mail the output to the developers.

Refactor mercilessly:
Another practice is to avoid deferring code cleanups and simplification. Instead of just hacking around a minor flaw and pressing onward, as soon as we notice the need to refactor code it's done immediately. At times this can be irritating, as a small fix turns up the need for a more elaborate refactoring which sidetracks you for a time, but it also prevents the code from slowly growing into a tissue of hacks because small problems are fixed early before they've had time to grow and entangle more of the code base.

Lack of code ownership:
Another principle is that anyone can modify any code in the system; components aren't owned by a single person who has the sole right to make changes. Most components have a primary author who wrote most of the initial code and is most familiar with it, but most subsystems have been modified by the other team members at some point. This practice prevents the existence of large swaths of code that are only understood by one person. It's noteworthy that the remote microscope system, the subsystem most closely tied to an individual author is also the biggest and most tangled hairball. (Fixing the remote microscope will require essentially discarding the current code and designing a new implementation from the ground up, applying our current code standards. We've now started working on this.)

The most significant practice of extreme programming we don't follow is pair programming, the practice of always having two developers write code together. In our group, pairs of developers will certainly collaborate on complicated classes or refactorings when we feel the need for backup, but most of the time we work individually.

Three of our developers (A.M. Kuchling, Neil Schemenauer, and Greg Ward) have also worked on various free software projects, Python itself being the one project we all have in common. In my experience, successful free software projects such as Python often have adopted some agile development practices. For example, all checkins to the Python source tree are sent to the python-checkins mailing list, where the other developers can read through the patch and offer comments and criticisms. We've adopted similar practices internally at the MEMS Exchange; we use CVS for version control, have a mailing list for project discussion and another list for checkins, and use a bug tracking system. When one developer moved out of our area, we became a geographically distributed development group, but our practices needed relatively little adaptation to handle the change.

10 Lessons Learned

One of the advantages of rebuilding a system is that you get the chance to fix shortcomings in the first version. This section will describe some of the most annoying problems in the first version of the system and how we fixed them. These problems aren't specific to our application domain, though, and are probably relevant to any sizable Web-based service. Our tools provide enough flexibility to deal with these problems in a straightforward way.

10.1 Persistent Sessions

One annoyance in the old system was that user sessions were stored in RAM, so they'd be lost if the Java or Zope process was restarted. Restarting could happen accidentally, if the process crashed and was automatically restarted by Apache, or deliberately, if we needed to install a bugfix. This was an uncomfortable situation, because if we discovered a bug in the middle of the working day, we couldn't deploy the corrected code for hours no matter how quickly we fixed the problem. Restarting the server would delete all the sessions and annoy everyone who was in the middle of using the site, so we'd have to wait until a quiet period in the evening or on a weekend.

We used Quixote and the ZODB to fix this, setting a randomly generated session cookie on Web browsers that access our site and using the cookie's value to look up a corresponding Session object. Session objects are also stored persistently in our ZODB and committed at the end of every HTTP transaction, meaning that we can restart the Quixote server at any time to load updated code, and we can reboot the machine without losing the collection of user sessions. When Quixote is killed, Apache will automatically restart it. All the user sessions are intact, and the only visible sign of the restart is that the next few page accesses will be unusually slow because all our Python modules and PTL pages have to be re-imported. (Because the FileStorage we use as the base of our ZODB is fully versioning, this means that it grows continually over the course of the day, but a nightly script backs up the database and then discards outdated versions of objects, so the file doesn't grow infinitely.)

10.2 Acting as Other Users

A common task with the first implementation was to look up a user's password in order to log in as that user. We did this to verify reported bugs and to perform tasks that users requested via phone, but having to do this was always irritating. In the new system, we wrote a fancier user login system and added the feature of letting MEMS Exchange staff act as other users.

The login machinery is straightforward. Session objects have two attributes, user and actual_user, which are set to either None or a User instance. Most permission tests only look at the user attribute, but logging in sets both user and actual_user. Acting as another user is allowed by a form that checks whether actual_user is a MEMS Exchange staff member; if so, the form allows them to set user to a different User object, causing most code to behave as if the selected user was acting. This only takes a few lines of code, and lets MEMS Exchange users behave as customers or as fab staff.

11 Future Work

Having described our existing system, I'll now look at my best guess of what the future holds.

We will continue to develop and enhance Quixote as we need to, but right now we don't foresee any significant changes that need to be made. At this point the most uncertain area is changing our object schema to meet future requirements, because we're not sure how easy or hard it will be to change the schema using the ZODB. When first developing the system we had the luxury of simply deleting our working databases and recreating them from scratch whenever we refactored a set of classes, but once we went into production, that was no longer possible. Therefore we now have to figure out how to upgrade the database on the fly.

The ZODB lets you define a __setstate__ method on a class. When a class defines a __setstate__ method, the method will be called whenever a class instance is unserialized off the disk. __setstate__() is passed the instance dictionary for the old instance, so the method can examine its contents and add, remove, or modify attributes as desired. For example, here's an implementation that adds two attributes, description and status:

def __setstate__ (self, dict):
    self.__dict__.update(dict)

    # Added 'description' attribute, 2001/06/05
    if not hasattr(self, 'description'):
        self.description = None

    # Added 'status' attribute, 2001/06/27
    if not hasattr(self, 'status'):
       self.status = None

__setstate__ provides a way to deal with changes to a single class's attributes, but there's no mechanism for handling changes to a set of classes. For example, if we decided that the ProcessRun and ProcessSequence classes should be merged, __setstate__ isn't sufficient to do that. So far we've handled this by writing one-shot conversion scripts that traverse all the relevant objects, upgrade them to the new implementation, and commit the resulting changes. This is complicated to do, and may turn out to be a future source of errors (though it hasn't been, so far).

Another problem is keeping the graph of objects inside the ZODB sane. It's quite easy to accidentally create copies of objects by not subclassing MXPersistent or by copying instances too enthusiastically, and this can be hard to detect until much later. For example, ProcessRun instances have an owner attribute that's an instance of the User class. A while ago we discovered that when a new version of a ProcessRun was created, the User object would be copied; this copy would only be accessible from the ProcessRun and wouldn't be updated if the user in question modified their registration details. Greg Ward has written a zodb_index.py script for tracing references, making it easier to debug such problems, but detecting them in the first place is still difficult. Do we need to write a graph checker? What would such a thing look like?

We also haven't tested the scalability of our service much. Current measurements show that a simple ``Hello World'' Quixote application can respond to 10 requests per second on our modest hardware (a Linux machine with two 733-MHz CPUs). Looking at our Web logs, we see that we average about one hit per minute over the course of the day, so we can still accommodate many more users than we're currently getting. If we ever do need to handle more traffic, I can see a fairly clear path for expansion by running multiple front-end Web servers and putting the ZEO server on a faster machine. Using a storage that's more efficient than the default FileStorage would also be a possibility for optimization. In short, performance has not been a problem yet, and if it ever becomes one, we have several different potential approaches to fixing it, and I'm confident that some or all of these approaches would pay off.

We're starting to support handheld computers, at least for use by fab site staff. Neil Schemenauer wrote a Palm application for entering metrology data, but that entails maintaining a PalmOS application written in C; yuck! As wireless networking becomes more mature and available inside fabs, we're going to move away from the Palm and toward handheld computers running Internet Explorer (or some other Web browser) on Windows CE. Because such handhelds can remain connected to the Internet at all times, they'll simply use our Web site.

We've already made the changes to support handhelds, as they only took about a day to make. The only change was to detect a client running on Windows CE and modify the HTML tables slightly, making them narrow enough to fit the handheld's small screen. Choosing the wireless Web is a win for everyone: the developers only have to maintain the Web site, not the Web site and a separate application; fabs get a more attractive user interface and more flexibility, as they can use any of the services we provide, not just the ones we manage to implement on the Palm; and the changes will also let customers use handhelds, should they want to.

12 Conclusion

The new system was put into production use on April 23rd, 2001, and was greeted with positive reactions from both the MEMS Exchange engineering team and the staff at our fabs. As the project enters the maintenance phase of development, I'm confident that the software development team can handle whatever future changes our engineers request. I have a few reasons for believing this:

To be more subjective, I think the MEMS Exchange is a good prototype for other Web-based applications. Using a high-level language such as Python makes writing code easy and pleasant, and the Python interpreter is reasonably fast and very stable. Quixote provides a simple yet flexible environment for developing and debugging Web applications. The ZODB makes it very easy to store persistent data without having to bridge the impedance mismatch between Python and a relational database. Tools such as Grouch and unit testing mean we can be reasonably confident that the code continues to run correctly. Software tools we build, such as Quixote and Grouch, can be released as free software because they're not core to our business, and because we'll benefit if other people use the software and help improve it.

12.1 Related Links

http://www.paulgraham.com/avg.html: ``Beating the Averages'', by Paul Graham, discusses how the Web gives programmers the freedom to use any language that lets you develop software more quickly. Graham argues for Lisp; I think Python is just as good for this purpose.

About this document ...

The MEMS Exchange Architecture

This document was generated using the LaTeX2HTML translator.

LaTeX2HTML is Copyright © 1993, 1994, 1995, 1996, 1997, Nikos Drakos, Computer Based Learning Unit, University of Leeds, and Copyright © 1997, 1998, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The application of LaTeX2HTML to the Python documentation has been heavily tailored by Fred L. Drake, Jr. Original navigation icons were contributed by Christopher Petrilli.