Parsing Arbitrary XML Namespaces in Ruby with Hpricot

20 May 2008 in Hpricot, Programming, Ruby on Rails by Garrick

(This post inspired by a Ruby.MN conversation)

I’ve burned through at least 3 different XML parsing libraries in building Cullect. I started the built-in REXML (pro: built-in, con: heavy and slow) then moved to FeedTools (pro: easily parses the most common feed formats, con: no longer in active development). I have a lot of love for FeedTools, but in the end, it didn’t want to parse XML it didn’t already know about. Then, after a series of far less memorable libraries, I found _why’s Hpricot (pro: written by _why, con: written by _why).

Hpricot is fast, lean, and doesn’t care about expected tags, namespaces, valid XML, it just makes it easy to get the data and attributes out of the XML.

Here’s an example of how I’m grabbing a feed item’s permalink using Hpricot

doc = Hpricot.XML(feed_contents)
items = (doc/:item)
items.each do |raw_item|
link = raw_item.%('pheedo:origLink') || raw_item.%('feedburner:origLink') || raw_item.%('link')
end

Notice how Hpricot doesn’t require anything special to grab pheedo namespace links or feedburner namespace links in comparison standard links. Just tell it what the tag is you’re looking for. Fast, easy, scalable.


Comment | Trackback URL Short Link:

Comments (10)

[...] Hpricot really will handle namespaces as nicely as Garrick says. Otherwise, I’m going to have to rewrite my app in Groovy/Grails and install a bunch of new [...]

Enfranchised Mind » The Status of Ruby’s libxml added these pithy words on May 27 08 at 6:40 am

Is there a way to print out something like the verbatim XML from “raw_item”?

And, on a totally different note, does Cullect support Atom in addition to RSS, or just RSS?

Robert Fischer added these pithy words on May 20 08 at 9:00 pm

@Robert,

1. Is this what you’re looking for?:

test = "<a><b><c>hello</c></b></a>"
doc = Hpricot.XML(test)
t = (doc/:a)
t.inner_html
=> "<b><c>hello</c></b>"

2. Yes, in addition to RSS, Cullect both parses and generates Atom feeds (it also generates YML, M3U, PLS, JSON feeds).

Garrick Van Buren added these pithy words on May 20 08 at 9:14 pm

Sweet. You made me a pretty happy camper!

Robert Fischer added these pithy words on May 21 08 at 7:03 am

Garrick:

Two things.

One, Cullect’s handling of feed entry ids is broken. Incorrect behavior: Entry stored in database uniquely identified by its id element. Correct behavior: Entries stored in database as children of the parent feed, with read status uniquely identified by the entry’s id element.

The reason what you’ve done is incorrect behavior is simple: I can hijack anyone else’s content trivially. I just have to write some other content with the same entry id, and get it plugged into the database before you poll the legitimate entry.

And second, because I ran afoul of the entry id issue, I was unable to complete my run of the Atom XML namespace conformance tests. Once you fix the first bug, I guess you’re welcome to run them yourself.

http://www.intertwingly.net/wiki/pie/XmlNamespaceConformanceTests

Bob Aman added these pithy words on May 29 08 at 9:07 am

Bob, thanks. You bring up an interesting point. Could you expand on what you mean by ‘legitimate’ entry?

Garrick Van Buren added these pithy words on May 29 08 at 9:21 am

Garrick:

I plugged my own blog into Cullect. Several of the entries had been usurped by a Splog that had republished my content, but had used the same entry ids as the original posts. The splog’s content entered the database before the legitimate entries, so the legitimate content was ignored.

If I wanted to be a total dick, I could write a script that would hammer your feed, and when a new entry was detected, it’d parse out the entry id, generate a new feed containing the same entry id, and say, a goatse image for content, and then immediately subscribe to it in Cullect. Because the initial parse will happen before the next polling of your feed, you never get to see your blog again. Just goatse images, forever and ever. Also, your bandwidth charges would be insane.

All in all, it’s a really, really bad idea. Not to mention the fact that duplicate ids? Yeah, they’re super-common. Sometimes you’ll even get duplicate ids within the same feed. The correct primary key is simply a bog-standard auto-incrementing integer, and then just make sure the guid field has an index on it.

Bob Aman added these pithy words on May 29 08 at 11:06 pm

Oh, and the rest of my content that wasn’t usurped by the splog? That’s showing up as belonging to several different Planet sites. All of whom correctly republished my content with the same id. But it’s still MY content, not theirs.

Bob Aman added these pithy words on May 29 08 at 11:09 pm

Bob - thanks. I’m sorry posts in Cullect are being assigned to someone other than yourself. It’s a known bug that will be remedied in an upcoming release.

Update - a couple hours later:
Bob, the entries should now be assigned to their rightful authors. Thanks again - for FeedTools, for stopping by and leaving comments, and for encouraging me to resolve this bug sooner rather than later.

Garrick Van Buren added these pithy words on May 29 08 at 11:21 pm

I’m having some problems parsing those namespace tags with hpricot, when I try to get the contents (eg: date = entry.at(’gd:when’).attributes['startTime']) I get nil returned.

Anyone know why this is?

Mark Turner added these pithy words on Aug 28 08 at 9:30 am

Add a Comment


XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Related Entries




Creative Commons License
About Sitemap XHTML Sitemap XML
Wordpress theme is a heavily hacked version of "Modicus Remix" by Art Culture. Original by Upstart Blogger