Tuesday, 20 May 2008

Parsing Arbitrary XML Namespaces in Ruby with Hpricot

(This post inspired by a Ruby.MN conversation)

I’ve burned through at least 3 different XML parsing libraries in building Cullect. I started the built-in REXML (pro: built-in, con: heavy and slow) then moved to FeedTools (pro: easily parses the most common feed formats, con: no longer in active development). I have a lot of love for FeedTools, but in the end, it didn’t want to parse XML it didn’t already know about. Then, after a series of far less memorable libraries, I found _why‘s Hpricot (pro: written by _why, con: written by _why).

Hpricot is fast, lean, and doesn’t care about expected tags, namespaces, valid XML, it just makes it easy to get the data and attributes out of the XML.

Here’s an example of how I’m grabbing a feed item’s permalink using Hpricot

doc = Hpricot.XML(feed_contents)
items = (doc/:item)
items.each do |raw_item|
link = raw_item.%('pheedo:origLink') || raw_item.%('feedburner:origLink') || raw_item.%('link')
end

Notice how Hpricot doesn’t require anything special to grab pheedo namespace links or feedburner namespace links in comparison standard links. Just tell it what the tag is you’re looking for. Fast, easy, scalable.