Parsing Arbitrary XML Namespaces in Ruby with Hpricot

(This post inspired by a Ruby.MN conversation)

I’ve burned through at least 3 different XML parsing libraries in building Cullect. I started the built-in REXML (pro: built-in, con: heavy and slow) then moved to FeedTools (pro: easily parses the most common feed formats, con: no longer in active development). I have a lot of love for FeedTools, but in the end, it didn’t want to parse XML it didn’t already know about. Then, after a series of far less memorable libraries, I found _why‘s Hpricot (pro: written by _why, con: written by _why).

Hpricot is fast, lean, and doesn’t care about expected tags, namespaces, valid XML, it just makes it easy to get the data and attributes out of the XML.

Here’s an example of how I’m grabbing a feed item’s permalink using Hpricot
doc = Hpricot.XML(feed_contents) items = (doc/:item) items.each do |raw_item| link = raw_item.%('pheedo:origLink') || raw_item.%('feedburner:origLink') || raw_item.%('link') end

Notice how Hpricot doesn’t require anything special to grab pheedo namespace links or feedburner namespace links in comparison standard links. Just tell it what the tag is you’re looking for. Fast, easy, scalable.

11 thoughts on “Parsing Arbitrary XML Namespaces in Ruby with Hpricot”

Robert Fischer says:

May 20, 2008 at 9:00 pm

Is there a way to print out something like the verbatim XML from “raw_item”?

And, on a totally different note, does Cullect support Atom in addition to RSS, or just RSS?
Garrick Van Buren says:

May 20, 2008 at 9:14 pm

@Robert,

1. Is this what you’re looking for?:
test = "<a><b><c>hello</c></b></a>" doc = Hpricot.XML(test) t = (doc/:a) t.inner_html => "<b><c>hello</c></b>"

2. Yes, in addition to RSS, Cullect both parses and generates Atom feeds (it also generates YML, M3U, PLS, JSON feeds).
Robert Fischer says:

May 21, 2008 at 7:03 am

Sweet. You made me a pretty happy camper!
Pingback: Enfranchised Mind » The Status of Ruby’s libxml
Bob Aman says:

May 29, 2008 at 9:07 am

Garrick:

Two things.

One, Cullect’s handling of feed entry ids is broken. Incorrect behavior: Entry stored in database uniquely identified by its id element. Correct behavior: Entries stored in database as children of the parent feed, with read status uniquely identified by the entry’s id element.

The reason what you’ve done is incorrect behavior is simple: I can hijack anyone else’s content trivially. I just have to write some other content with the same entry id, and get it plugged into the database before you poll the legitimate entry.

And second, because I ran afoul of the entry id issue, I was unable to complete my run of the Atom XML namespace conformance tests. Once you fix the first bug, I guess you’re welcome to run them yourself.

http://www.intertwingly.net/wiki/pie/XmlNamespaceConformanceTests
Garrick Van Buren says:

May 29, 2008 at 9:21 am

Bob, thanks. You bring up an interesting point. Could you expand on what you mean by ‘legitimate’ entry?
Bob Aman says:

May 29, 2008 at 11:06 pm

Garrick:

I plugged my own blog into Cullect. Several of the entries had been usurped by a Splog that had republished my content, but had used the same entry ids as the original posts. The splog’s content entered the database before the legitimate entries, so the legitimate content was ignored.

If I wanted to be a total dick, I could write a script that would hammer your feed, and when a new entry was detected, it’d parse out the entry id, generate a new feed containing the same entry id, and say, a goatse image for content, and then immediately subscribe to it in Cullect. Because the initial parse will happen before the next polling of your feed, you never get to see your blog again. Just goatse images, forever and ever. Also, your bandwidth charges would be insane.

All in all, it’s a really, really bad idea. Not to mention the fact that duplicate ids? Yeah, they’re super-common. Sometimes you’ll even get duplicate ids within the same feed. The correct primary key is simply a bog-standard auto-incrementing integer, and then just make sure the guid field has an index on it.
Bob Aman says:

May 29, 2008 at 11:09 pm

Oh, and the rest of my content that wasn’t usurped by the splog? That’s showing up as belonging to several different Planet sites. All of whom correctly republished my content with the same id. But it’s still MY content, not theirs.
Garrick Van Buren says:

May 29, 2008 at 11:21 pm

Bob – thanks. I’m sorry posts in Cullect are being assigned to someone other than yourself. It’s a known bug that will be remedied in an upcoming release.

Update – a couple hours later:
Bob, the entries should now be assigned to their rightful authors. Thanks again – for FeedTools, for stopping by and leaving comments, and for encouraging me to resolve this bug sooner rather than later.
Mark Turner says:

August 28, 2008 at 9:30 am

I’m having some problems parsing those namespace tags with hpricot, when I try to get the contents (eg: date = entry.at(‘gd:when’).attributes[‘startTime’]) I get nil returned.

Anyone know why this is?
Todd says:

August 21, 2010 at 11:19 am

Hpricot does care about namespaces… for example

str = %(

barfoo

foobar
)

doc = Hpricot(str)

doc.search(“entry > gd:when”)

no results…

but change gd:when to when

and

results!