Parsing Arbitrary XML Namespaces in Ruby with Hpricot

(This post inspired by a Ruby.MN conversation)

I’ve burned through at least 3 different XML parsing libraries in building Cullect. I started the built-in REXML (pro: built-in, con: heavy and slow) then moved to FeedTools (pro: easily parses the most common feed formats, con: no longer in active development). I have a lot of love for FeedTools, but in the end, it didn’t want to parse XML it didn’t already know about. Then, after a series of far less memorable libraries, I found _why‘s Hpricot (pro: written by _why, con: written by _why).

Hpricot is fast, lean, and doesn’t care about expected tags, namespaces, valid XML, it just makes it easy to get the data and attributes out of the XML.

Here’s an example of how I’m grabbing a feed item’s permalink using Hpricot

doc = Hpricot.XML(feed_contents)
items = (doc/:item)
items.each do |raw_item|
link = raw_item.%('pheedo:origLink') || raw_item.%('feedburner:origLink') || raw_item.%('link')
end

Notice how Hpricot doesn’t require anything special to grab pheedo namespace links or feedburner namespace links in comparison standard links. Just tell it what the tag is you’re looking for. Fast, easy, scalable.

11 comments

  1. Is there a way to print out something like the verbatim XML from “raw_item”?

    And, on a totally different note, does Cullect support Atom in addition to RSS, or just RSS?

  2. @Robert,

    1. Is this what you’re looking for?:

    test = "<a><b><c>hello</c></b></a>"
    doc = Hpricot.XML(test)
    t = (doc/:a)
    t.inner_html
    => "<b><c>hello</c></b>"

    2. Yes, in addition to RSS, Cullect both parses and generates Atom feeds (it also generates YML, M3U, PLS, JSON feeds).

  3. Garrick:

    Two things.

    One, Cullect’s handling of feed entry ids is broken. Incorrect behavior: Entry stored in database uniquely identified by its id element. Correct behavior: Entries stored in database as children of the parent feed, with read status uniquely identified by the entry’s id element.

    The reason what you’ve done is incorrect behavior is simple: I can hijack anyone else’s content trivially. I just have to write some other content with the same entry id, and get it plugged into the database before you poll the legitimate entry.

    And second, because I ran afoul of the entry id issue, I was unable to complete my run of the Atom XML namespace conformance tests. Once you fix the first bug, I guess you’re welcome to run them yourself.

    http://www.intertwingly.net/wiki/pie/XmlNamespaceConformanceTests

  4. Garrick:

    I plugged my own blog into Cullect. Several of the entries had been usurped by a Splog that had republished my content, but had used the same entry ids as the original posts. The splog’s content entered the database before the legitimate entries, so the legitimate content was ignored.

    If I wanted to be a total dick, I could write a script that would hammer your feed, and when a new entry was detected, it’d parse out the entry id, generate a new feed containing the same entry id, and say, a goatse image for content, and then immediately subscribe to it in Cullect. Because the initial parse will happen before the next polling of your feed, you never get to see your blog again. Just goatse images, forever and ever. Also, your bandwidth charges would be insane.

    All in all, it’s a really, really bad idea. Not to mention the fact that duplicate ids? Yeah, they’re super-common. Sometimes you’ll even get duplicate ids within the same feed. The correct primary key is simply a bog-standard auto-incrementing integer, and then just make sure the guid field has an index on it.

  5. Oh, and the rest of my content that wasn’t usurped by the splog? That’s showing up as belonging to several different Planet sites. All of whom correctly republished my content with the same id. But it’s still MY content, not theirs.

  6. Bob – thanks. I’m sorry posts in Cullect are being assigned to someone other than yourself. It’s a known bug that will be remedied in an upcoming release.

    Update – a couple hours later:
    Bob, the entries should now be assigned to their rightful authors. Thanks again – for FeedTools, for stopping by and leaving comments, and for encouraging me to resolve this bug sooner rather than later.

  7. I’m having some problems parsing those namespace tags with hpricot, when I try to get the contents (eg: date = entry.at(‘gd:when’).attributes[‘startTime’]) I get nil returned.

    Anyone know why this is?

  8. Hpricot does care about namespaces… for example

    str = %(

    barfoo

    foobar
    )

    doc = Hpricot(str)

    doc.search(“entry > gd:when”)

    no results…

    but change gd:when to when

    and

    results!

Comments are closed.