Extensible Markup Language (XML) defines a set of rules for encoding documents in a format that is both human- and machine-readable.
It's been around since about 1996, so all major platforms have good support
Has robust validation frameworks: Document Type Definition (DTD) and XML Schema Definition (XSD)
Subject to vulnerabilities. (For example examples/billion_lolz.py)
In the repository you'll find a random portion of the wikipedia database, found in examples/data/enwiki-latest-pages-articles1.xml-p000000010p000010000-shortened.bz2
Original source is here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2, but it's pretty big so we'll just use a subset of it.
We'll parse this using several of the tools presented.
The xml.sax package provides a number of modules which implement the Simple API for XML (SAX) interface for Python
A SAX application has three kinds of objects: readers, handlers, and input sources
xml.sax.handler defines 4 kinds of handlers:
The most important handler for getting your job done is xml.sax.handler.ContentHandler
To create your handler, just subclass ContentHandler
3 methods are necessary to override in order to get element data:
examples/wikipedia-sax.py creates a handler which logs the content of every <title> tag it encounters
Try running this to make sure it works
Now it's your turn.
The xml.dom package provides a number of modules which implement the Document Object Model API
A reference implementation is in xml.dom.minidom - Minimal DOM implementation
DOM is an RFC standard which minidom implements. If you've done DOM manipulation in Javascript, the methods will be familiar: createElement, getElementsByTagName, appendChild, ..
To get a reference to a Document object for your XML, just pass it to one of the parse methods:
The optional parser argument is for passing a SAX2 parser object
You can also create a Document from scratch by calling a method on a "DOM Implementation" object:
from xml.dom.minidom import getDOMImplementation
impl = getDOMImplementation()
newdoc = impl.createDocument(None, "some_tag", None)
top_element = newdoc.documentElement
text = newdoc.createTextNode('Some textual content.')
top_element.appendChild(text)
Duplicate a node with node.cloneNode(deep) # True or False to specify a deep or shallow copy
Get a reference to a node's parent with node.parentNode
append a node to another with node.appendChild(element)
Create a new node with document.createElement('elementname')
examples/wikipedia-minidom.py reads a Document, then logs the content of every <title> tag it encounters
Try running this to make sure it works
Now it's your turn.
Integrated into Python 2.5, provides a more Pythonic API to the document, not tied to the DOM standard
Provides limited XPath support
xml.etree.cElementTree is a compatible C implementation
Parse a file with xml.etree.ElementTree.parse
Parse an XML string with xml.etree.ElementTree.fromstring
Write out an ElementTree with xml.etree.ElementTree.write(f) # filename or file object
A W3C specification for uniquely identifying elements and attributes in an XML document
Multiple namespaces can be included in one document
Namespace names are URIs, strings chosen for their uniqueness property. Nothing is implied about the data behind the URI.
XML namespace definitions are added to an element, and apply to all child elements. Thus they are usually applied to the root element. For instance, in our sample data the root element is:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en" >
The namespace of an element is indicated by adding prefix: to the element, e.g.
<html:p>explicit html p tag</html:p >
In an Element tree, qualified names are stored fully qualified in "Clark's notation", which is a single string of the form {uri}local
, e.g. {http://www.mediawiki.org/xml/export-0.8/}page
When locating elements in a namespaced document, either pass the fully qualified name in the form above to find/findall/iterfind, or add the namespaces kwarg. e.g.
namespaces = {'xmlns': 'http://www.mediawiki.org/xml/export-0.8/'}
for title in root.findall('xmlns:page/xmlns:title', namespaces=namespaces):
examples/wikipedia-elementtree.py logs the content of every <title> tag it encounters
Try running this to make sure it works
Now it's your turn, do the same author counting as with SAX, but this time use ElementTree.
lxml is Python binding for the C libraries libxml2 and libxslt
Does not ship with Python
provides an interface similar to ElementTree
provides a SAX compliant API
has a focus on performance
/