(or “Your broken XML broke my Tree!”)
Recently, I decided to parse an HTML file using ElementTree. Nothing fancy, seems most of the HTML is well-formed anyway. But (there is always a but) the source file I’m trying to parse starts with , have no
Now, this isn’t purely a problem with ElementTree. If I save the file and use TextMate “tidy HTML” and run through ElementTree again, everything works perfectly.
Not only the elements themselves get weird, but ElementTree can’t use any special filters in its XPath search, like indexes (div[2]), or any attribute search ([@class="class"]).
The solution I found was convert the whole XPath (without any attribute search) to a longer form, which seems to work fine (it solves my problem), adding the “{http://www.w3.org/1999/xhtml}” to every element and doing the index search manually.
def _find_xpath(root, path):
"""Finds an XPath element "path" in the element "root", converting to the
weird information ElementTree."""
elements = path.split('/')
path = []
for el in elements:
if not el.endswith(']'):
path.append('{http://www.w3.org/1999/xhtml}'+el)
else:
# collect what we have, find the element, reset root and path
this_element = el.split('[')
# first part, without the
path.append('{http://www.w3.org/1999/xhtml}'+this_element[0])
xpath = '/'.join(path)
root = root.findall(xpath)
pos = int(this_element[1][0:-1]) -1
root = root[pos]
path = []
if len(path) > 0:
xpath = '/'.join(path)
root = root.find(xpath)
return root
I reckon is not the cleanest solution and that I should probably use recursion somehow, but it works.
Improvement suggestions are welcomed.

