Workaround for Broken ElementTrees

(or “Your broken XML broke my Tree!”)

Recently, I decided to parse an HTML file using ElementTree. Nothing fancy, seems most of the HTML is well-formed anyway. But (there is always a but) the source file I’m trying to parse starts with , have no and, for some reason, this makes ElementTree a very confused parser. And, by confused, all the elements, instead of keeping their original tags (e.g., table) have the prefix “{http://www.w3.org/1999/xhtml}” added to them (e.g., {http://www.w3.org/1999/xhtml}table).

Now, this isn’t purely a problem with ElementTree. If I save the file and use TextMate “tidy HTML” and run through ElementTree again, everything works perfectly.

Not only the elements themselves get weird, but ElementTree can’t use any special filters in its XPath search, like indexes (div[2]), or any attribute search ([@class="class"]).

The solution I found was convert the whole XPath (without any attribute search) to a longer form, which seems to work fine (it solves my problem), adding the “{http://www.w3.org/1999/xhtml}” to every element and doing the index search manually.

def _find_xpath(root, path):
    """Finds an XPath element "path" in the element "root", converting to the
    weird information ElementTree."""
    elements = path.split('/')
    path = []
    for el in elements:
        if not el.endswith(']'):
            path.append('{http://www.w3.org/1999/xhtml}'+el)
        else:
            # collect what we have, find the element, reset root and path
            this_element = el.split('[')
            # first part, without the 
            path.append('{http://www.w3.org/1999/xhtml}'+this_element[0])
            xpath = '/'.join(path)
            root = root.findall(xpath)

            pos = int(this_element[1][0:-1]) -1
            root = root[pos]

            path = []

    if len(path) > 0:
        xpath = '/'.join(path)
        root = root.find(xpath)

    return root

I reckon is not the cleanest solution and that I should probably use recursion somehow, but it works.

Improvement suggestions are welcomed.