Workaround for Broken ElementTrees

(or “Your broken XML broke my Tree!”)

Recently, I decided to parse an HTML file using ElementTree. Nothing fancy, seems most of the HTML is well-formed anyway. But (there is always a but) the source file I’m trying to parse starts with , have no and, for some reason, this makes ElementTree a very confused parser. And, by confused, all the elements, instead of keeping their original tags (e.g., table) have the prefix “{http://www.w3.org/1999/xhtml}” added to them (e.g., {http://www.w3.org/1999/xhtml}table).

Now, this isn’t purely a problem with ElementTree. If I save the file and use TextMate “tidy HTML” and run through ElementTree again, everything works perfectly.

Not only the elements themselves get weird, but ElementTree can’t use any special filters in its XPath search, like indexes (div[2]), or any attribute search ([@class="class"]).

The solution I found was convert the whole XPath (without any attribute search) to a longer form, which seems to work fine (it solves my problem), adding the “{http://www.w3.org/1999/xhtml}” to every element and doing the index search manually.

def _find_xpath(root, path):
    """Finds an XPath element "path" in the element "root", converting to the
    weird information ElementTree."""
    elements = path.split('/')
    path = []
    for el in elements:
        if not el.endswith(']'):
            path.append('{http://www.w3.org/1999/xhtml}'+el)
        else:
            # collect what we have, find the element, reset root and path
            this_element = el.split('[')
            # first part, without the 
            path.append('{http://www.w3.org/1999/xhtml}'+this_element[0])
            xpath = '/'.join(path)
            root = root.findall(xpath)

            pos = int(this_element[1][0:-1]) -1
            root = root[pos]

            path = []

    if len(path) > 0:
        xpath = '/'.join(path)
        root = root.find(xpath)

    return root

I reckon is not the cleanest solution and that I should probably use recursion somehow, but it works.

Improvement suggestions are welcomed.

Web 2.0 is not streamable

This week our connection at home is shaped. This means that, instead of the shinny 1Mbp/s that we usually have, now we have to suffer to see pages with a bandwidth of just 64Kbp/s. But there is one thing that such limited bandwidth made me realize: The next web isn’t streamable.

To get to that conclusion, I hadn’t have to go far: Just opening Google Reader shown that it’s impossible to live with a very limited bandwidth. Right now, I should have something like 1000 unread news in 1 hundred subscriptions, which means Reader have to download a large description file with all that information. Thing is, right now, it doesn’t do anything: It shows the default Google application header, the logo and that’s it. But, knowing how things usually works in this Web 2.0 universe, I know that there is something going on:

Interactive sites, like Google Reader and GMail use AJAX. AJAX relies on XML, which is a structured plain text data (the same can be said for JSON.) XML allows the data to be in any other inside their structure. As an example, imagine a book information list: Inside the “Book” item, you can have a “Title”, which can be in the very beginning or the very end, but the result would be the same. So, any application that uses XML need to first receive the information, then convert it to some internal representation and then it can be used. Google Reader wasn’t “doing nothing”: It was receiving the list of feeds and the initial 100-something feed items which, due the small bandwidth, was taking very long. And, because it needed the whole thing, nothing was being displayed.

Which is a problem I see with many XML/JSON results: You can’t stream them in a way that you can start using the information before having it all. For example, in Mitter, we can’t display tweets before we received the whole message. If XML and JSON weren’t so loosely defined and we had a way to assure that after the element “User” we would have an element “Message”, then we could start displaying tweets before we had all of them (not that the format changes all the time, but since we can’t ensure that ordering, we must be ready for the data appearing in a different order — or with some other data between the ones we need.)

In a way, that’s a complete reverse of roles for AJAX. In the very beginning, AJAX was used to prevent large downloads: If you had a page where it would be useful to display all options to the user to help him/her to find data, you’d have to fill the page with that data (imagine, for example, a page with all your Del.icio.us tags, plus all the possible suggestions for all the other users.) The use of AJAX meant the site could filter results, so you’d have a smaller page, with would do small requests to the webserver, returning small amounts of data. In overall, it meant that the user experience would be faster. Now, we have so much information packed in XML/JSON formats that the user experience is not as responsive as it should.