| Author: | A.M. Kuchling <amk@amk.ca> |
|---|---|
| Version: | 1820 |
| Date: | 2006-02-07 |
<document>
<?xml-stylesheet type="text/css"
href="basic.css"?>
<!-- Generated with ElementTree -->
<author name='amk' href="http://www.amk.ca" />
<p class="note">Note.</p>
<p class="warning">Warning paragraph.</p>
<p>Regular paragraph.</p>
</document>
<?xml version="1.0"?> <document> <h1>Heading</h1> <p>Paragraph. <em>Word</em></p> </document>
Prints all paragraphs that have 'class=warning' attribute:
from elementtree import ElementTree as et
tree = et.parse('ex-1.xml')
for para in tree.getiterator('p'):
cl = para.get('class')
if cl == 'warning':
print para.text
et.parse(source): returns a tree
tree = et.parse('ex-1.xml')
tree = et.parse(open('ex-1.xml', 'r'))
feed = urllib.urlopen(
'http://planet.python.org/rss10.xml')
tree = et.parse(feed)
et.XML("<?xml ..."): returns root element
svg = et.XML("""<svg width="10px" version="1.0">
</svg>""")
svg.set('height', '320px')
svg.append(elem1)
...
et.XMLID(str) : (Element, dict)
| xml_doc = """<document>
| <h1 id="chapter1">...</h1>
| <p id="note1" class="note">...</p>
| <p id="warn1" class="warning">...</p>
| <p>Regular paragraph.</p>
| </document>"""
| root, id_d = et.XMLID(xml_doc)
| print id_d
| {'note1': <Element p at 3df3a0>,
| 'warn1': <Element p at 3df468>,
| 'chapter1': <Element h1 at 3df3f0>}
For XMLID():
- The dictionary maps element IDs to elements.
- It looks for attributes named 'id'.
- xml:id is not yet supported
et.ElementTree([root], [file]) -- Creates a new ElementTree
root = et.XML('<svg/>')
new_tree = et.ElementTree(root)
tree = et.ElementTree(file='ex-1.xml')
tree.write(file, encoding) -- outputs XML to file
# Encoding is US-ASCII
tree.write('output.xml')
f = open('output.xml', 'w')
tree.write(f, 'utf-8')
file can be a filename string or a file-like object that has a write()) method.
The default encoding is us-ascii, which isn't very useful. You'll usually want to specify UTF-8.
Namespace declarations are generated on output. Prefixes aren't preserved, so instead of 'dc:creator' you'll get something like 'ns0:creator'.
tree.getroot() : returns root element of a tree.
root = tree.getroot() for c in root.getchildren(): ...
tree|elem.getiterator([tag]) -> iterator over elements
# Print all elements
for elem in tree.getiterator():
...
for elem in tree.getiterator('*'):
...
# Print all paragraph elements
for elem in tree.getiterator('p'):
...
Traversal is pre-order:
document, block, p, p, block, p, block, p
elem.tag : the element's name
Namespaces are treated as "{namespace-uri}tag":
| <h:html xmlns:xdc="http://www.xml.com/books" | xmlns:h="http://www.w3.org/HTML/1998/html4"> | <h:body> | <xdc:bookreview> ...
| h:html | {http://www.w3.org/HTML/1998/html4}html |
| h:body | {http://www.w3.org/HTML/1998/html4}body |
| xdc:bookreview | {http://www.xml.com/books}bookreview |
Children are accessed by slicing.
elem[n] returns the n'th element
elem[m:n] returns list of m'th through n'th children
len(elem) returns the number of children
elem.getchildren() -- returns list of children
Adding children:
elem[m:n] = [e1, e2]
elem.append(elem2) -- append as last child
elem.insert(index, elem2) -- insert at given index
Removing children:
del elem[n] -- delete n'th child
elem.remove(elem2) -- remove elem2 if it's a child
elem.makeelement(tag, attr_dict)
et.Element(tag, attr_dict, **extra)
et.SubElement(parent, tag, attr_dict, **extra)
feed = root.makeelement('feed',
{'version':'0.3'})
svg = et.Element('svg', {'version':'1.0'},
width='100px', height='50px')
defs = et.SubElement(svg, 'defs', {})
Atom 0.3 input looks like:
<feed version="0.3" xmlns="http://purl.org/atom/ns#"> ...
<entry> ...
<content type="text/html" mode="escaped">
<p><a href="http://example.org">This
picture</a> ... </p> </content>
</entry>
</feed>
We want HTML output like this:
<div> <p><a href="http://example.org">This picture</a>... </p> <hr /> <p><a href="http://example.org/2">Photo 2</a>... </p> <hr /> </div>
ATOM_NS = 'http://purl.org/atom/ns#'
tree = et.parse('atom-0.3.xml')
div = et.Element('div')
html = et.ElementTree(div)
for entry in tree.getiterator('{%s}entry'
% ATOM_NS):
for content in entry.getiterator('{%s}content'
% ATOM_NS):
# Check for right content element here
for content in entry.getiterator('{%s}content'
% ATOM_NS):
typ = content.get('type')
mode = content.get('mode')
if typ == 'text/html' and mode == 'escaped':
subtree = et.XML('<root>' +
content.text.strip()
+ '</root>')
for c in subtree.getchildren():
div.append(c)
div.append(et.Element('hr'))
html.write(sys.stdout)
elem.attrib : dictionary mapping names to values
elem.get(name, default=None) : get attribute value
elem.set(name, value): set a new value
elem.keys(): list of attribute names
elem.items(): list of (name, value) tuples
del elem.attrib[name]: delete an attribute
You can also access the .attrib dictionary directly.
Convert Atom 0.3 'content' elements to 1.0 form:
ATOM_CONTENT = '{%s}content' % ATOM_NS
for content in tree.getiterator(ATOM_CONTENT):
typ = content.get('type')
mode = content.get('mode')
if typ == 'text/html' and mode == 'escaped':
content.set('type', 'html')
del content.attrib['mode']
Elements have two attributes for text:
.text -- content between the element and its first child
.tail -- content between the element and its following sibling
<document><elem1>e1 content</elem1> Inter-element text <elem2>e2 content</elem2> </document>
Checking if an element is a comment or PI:
if elem.tag is et.Comment:
...
elif elem.tag is et.ProcessingInstruction:
...
et.Comment(text) -- create a comment
et.ProcessingInstruction(target, text=None) -- create a PI
- ElementPath
- Parsing HTML
- Event-based parsing
ElementTree supports a small query language:
Simpler version of entry/content loop:
for content in tree.findall('entry/content'):
...
Methods:
findall(query): list of matching nodes
find(query): first matching element, or None
findtext(query, default=None): .text attribute of first matching element
Query = components separated by '/'
| Component | Meaning |
|---|---|
| . | Current element node |
| * | Matches any child element |
| <empty string> | Match any descendant |
| <name> | Matches elements with that name |
| Query | Result |
|---|---|
| p | All p elements |
| .//p | All p elements |
| chapter/p | All p that are children of a chapter |
| chapter/*/p | All p that are grandchildren of a chapter |
| chapter//p | All p that are descendants of a chapter |
| quotation/{http://purl.org/dc/elements/}creator | All dc:creator children of quotation elements |
This syntax is inspired by XPath, but it's a tiny, tiny subset of XPath. Missing features include:
- Absolute queries not allowed
- Can only select element nodes; there's no text() to select child text nodes
- No ability to select a numbered child (chapter[5] to select the fifth chapter element)
HTMLTreeBuilder, TidyHTMLTreeBuilder (requires elementtidy)
from elementtree import TidyHTMLTreeBuilder
page = urllib.urlopen('http://www.python.org')
tree = TidyHTMLTreeBuilder.parse(page)
et.iterparse returns a stream of events.
parser = et.iterparse('largefile.xml',
['start', 'end', 'start-ns', 'end-ns'])
for event, elem in parser:
if event == 'end':
...
# Discard element's contents
elem.clear()
On my Mac, simply parsing the 1.5Mb file into a tree took up about 7Mb. With iterparse and the clear(), the peak usage was about 2Mb. (I used the Book of Mormon because it was the largest XML document I could find.)
ElementTree: http://effbot.org/zone/element-index.htm
Slides: http://www.amk.ca/talks/2006-02-07
Questions?
Features:
- xinclude:include
- Parse types of 'text' or 'xml'
- What's not supported: xpointer, the 'encoding' or 'accept' attributes.
<root xmlns:xi="http://www.w3.org/2001/XInclude"> <xi:include href="ex-1.xml" parse="xml"/> </root>
Code:
from elementtree import ElementInclude ElementInclude.include(tree.getroot())
Result:
<root>
<document>
<h1>Heading</h1> ...
</document>
</root>
The include() function recursively scans through the entire subtree of the element. You can supply your own loader function that receives the 'href' attribute