pydanny: feedparser

Showing posts with label feedparser. Show all posts

Tuesday, April 7, 2009

The end of my Feedfeeder story

Another post about Plone... but this time about me and not about Plone.

For about 18 months I have wrestled with consuming broken RSS feeds to pick up image of the day fields stipulated by customers. These are feeds so broken that no RSS parser, including the masterful Feedparser, can handle them (for example, one image of the day feed usually puts the image in the RSS header and changes that each day - no history is maintained). They aren't actually RSS, they just possess a file name that ends with '.rss'. Plus, periodically the way they are written changes so custom logic fails.

I have forked Reinout van Rees FeedFeeder project, and even proposed complicated logical revisions to handle broken these broken feeds and their shifting implementation. I called it Feedfeeder v2. Reinout always seemed hesitant, and I watched as other people extended on his work and despaired. I knew something was wrong but couldn't put my finger on it. I hesitated to work on it, even though funding for it was readily available.

Then between Spacebook, Pinax, and other efforts I shelved this effort for months, hiding my head in the virtual sand. And yet I knew it needs to be addressed. How could I handle something that broke the otherwise wonderful Feedparser?

During Pycon 2009 I came up with the answer. I took an excellent tutorial on html scraping and learned lots of little tricks to reinforce my skills with BeautifulSoup. You see, screen scraping is a secret pleasure I have. Scraping out a bit of data from a page is like a little puzzle. When I talked about this to someone, in the middle of my discussion with them the answer became clear as day.

The answer was to turn the problem from a RSS interpretation problem to a simple web page scraping puzzle.

Fetch via urllib the XML file that pretends to be RSS.
Parse it using BeautifulSoup or html5lib.
Get all the images listed.
Discard all but the largest image.
Guess out the meta-data from the XML file and store that for the image.

Problem solved.

Now I just need to make a Plone 3 package to do this for me and my angst is finished.

My apologies Reinout for the time spent on trying to cook a solution via Feedfeeder. Thank you for your insights and your extreme patience. I think you tried to tell me to take a different path.

Wednesday, October 22, 2008

Morning brainstorm about FeedFeeder v2

I've been working on a .plan for FeedFeeder v2, but for some reason things were not really coming together. Something seemed off. In retrospect, what was off was that my proposed solution didn't immediately correct the current problem with the otherwise excellent current version of FeedFeeder. And that problem is that any anomalous feeds force you to write and deploy code (ie - plugins) to correct the anomaly.

Sure, the Van Rees brothers had agreed that a future stage would correct the problem via a TTW function, and we would even consider a handy AJAX powered GUI to make it intuitive. However, the issue with that is that it would occur at a future stage, not at a stage that worked with my current use case - that I get feeds from the customer that they want today to work in nasascience.nasa.gov. Speaking on the finanical side, how could I get NASA to pay for work done on FeedFeeder v2 if it doesn't correct our current issues out of the box?

Well, this morning the answer came to me. The solution to the problem was rather clear and simple. Rather than a sophisticated plug-in system what about a definition system? Currently FeedFeeder provides two content types:

FeedFolder:
- includes a field listing the feeds consumed by this folder
- and is a container for holding feed definitions and feed items

FeedItem:
- individual feed content items provided by the feeds defined in the FeedFolder

My solution proposes adding a third content type called 'FeedDefinition' to handle defining of feeds:

FeedFolder:
- a container for holding feed definitions and feed items
FeedItem:
- individual feed content items provided by the feeds defined in the FeedFolder's FeedDefinitions
FeedDefinition:
- Defines the source of a feed and how to handle the feed

A FeedDefinition would likely include the following fields in addition to the defaults:

Source:
- URI of the feed source
FeedTitle:
- default: standard
- otherwise define location of feed title based on FeedParser output

FeedDescription:
- default: standard
- otherwise define location of feed description based on FeedParser output

ItemTitle:
- default: standard
- otherwise define location of item title based on FeedParser output

ItemDescription:
- default: standard
- otherwise define location of item description based on FeedParser output

Replacements:
- default: empty
- lines field that shows what text needs to be replaced with other values.
- example: 'www.nasa.gov -> nasawww-origin1.hq.nasa.gov'

When handling feeds, when a FeedFolder has its update_feed_item action triggered it would:

Iterate through its FeedDefinition children.
Based off the rules in each FeedDefinition, fetch and parse each feed.
The parsed feeds would be then added to the FeedFolder as FeedItems.

A supplementary view for FeedFolder would be provided that would not display the FeedDefinition.

Comments? Thoughts? Anyone think I should move my blog to a place that handles comments better?

Monday, September 15, 2008

Plone OS projects take two: Radius package and FeedFeeder package

I still haven't made up my mind. Lets go over my options, since working on either can be lots of fun. Do keep in mind my target Plone version is 3.x.

Plone Radius Package
This would be really useful for my job. A Plone package that allows authentication via Radius/RSA would likely mean lots more Plone work for NASA HQ. Once I got a functional prototype I'm sure I could get some funding for more work. Since Wichert Akkerman's pyrad python module is supposedly pure Python this makes integration really easy. I like easy integration.

One thing I like about this potential project is that it should be a pretty quick effort. Actually, the hard part will probably be finding a server to test against.

FeedFeeder Package
Reinout van Rees invited me in a response to this post to take a crack at FeedFeeder when I brought up this issue. The issue?

Feedfeeder assumes the best out of its sources, and assumes that FeedParser is going to return something nice. What if we could make FeedFeeder either assume the worst of its sources, or give FeedFeeder administrators more flexibility in how to handle feeds?

Alas, the problem with RSS (and even Atom) is that people consider the specification (if they actually look at the specification) as mere loose guidelines. I'm not going to point any fingers at anyone because I like my job, but I will say that the ability of Web Browsers to look at anything remotely like RSS and then display the contents like a feed makes life for us Plone developers a pain in the butt.

Periodically, I got people saying, "Include this as a feed!" until I trained them to realize that most RSS feeds are junk. Which is nevertheless embarrassing when the so-called feed that displays as an RSS feed in Firefox or Safari is completely screwy when it comes the XML. In fact, at my job we've pretty much forked FeedFeeder in order to support customer requests, with each RSS feed item being a custom script. The results work and yet are not very pretty.

So my big idea is this that FeedFeeder would be enhanced in one of two ways:

Custom Scripts - FeedFeeder administrator can do TTW scripts (portable via Generic Setup) to control how FeedFeeder parses the incoming feed. The scripting would be restricted Python. This way the same feed that can be seen via the browser can now be interpreted by FeedFeeder as well. The problem is the normal sort of issues you get with TTW programming, especially when it comes time to validate the script, or port it around (Even with Generic Setup).
Custom Plugins - How about a plugin system of some sort? Basically, you would follow a standard API and put your plugins in a particular folder. FeedFeeder would pick up the plugin and run the appropriate plugin (we would have a selector tool) against the appropriate feed. This way we could grow the functionality and robustness of the tool as more RSS and Atom feeds are added, and could also support new protocols as they become popular

Idea #1 seems quick to do, yet iffy and chock full of potential surprises, but Idea #2 seems like a solid way to do this effort.

BTW, Reinout van Rees has responded to all my posts on the subject of FeedFeeder. He commented on my first post, which was me whining about not reading the code, to the second which was deliberating on making a RSS package that was like FeedFeeder, but could handle problematic RSS better.

So hopefully Reinout is going to read this post too and share an opinion. ;)

Sunday, April 27, 2008

Feedparser does not work with Google App Engine

After my laughable mistake of trying to do a import feedparser.py, I sat down yesterday and spent half an hour writing my rss aggregator for Google App Enginer. Critical, of course, was use of the excellent feedparser project. It was easy to get everything working, and while not styled it looked good. Everything except for using Feedparser to parse the incoming rss and atom feeds.

Alas, Feedparser tries to use a few modules that the enterprising folks at Google restricted. I haven't done any research yet, but I wonder if it is in the arena of fetching data from URLs, since app engine has its own library. I'll poke at it tomorrow.

In any case, I was very pleased with Google App Engine. Lets go over why:

Database is not a RDBMS. Some people might scream at this issue, but the benefits we get are wonderful. Expando seems really fun to use.
Built-in ORM. Sure, its not SQLAlchemy or the Django system, but its not that far different in approach and implementation.
Django Templates! If I'm not doing TAL and I'm doing XHTML/XML, then my choice is Django. Templates.
Cleanly documented. Clear and simple sentences with good examples that are working code, not doc or CLI tests.
Easy and intuitive. This part is critical. The framework is not in the way.

Update: Apparently Feedparser works with GAPE. Either something changed about GAPE (feedparser hasn't been updated in a while) or maybe I had a bug. Thanks to Alex UK and crchemist for pointing this out.

Thursday, April 24, 2008

What I want in a feed aggregator

The list is simple:

One page that displays all the content. Maybe do some pagination, or hide descriptions and just show titles. Otherwise have tags, author, description, and link to original post.
One page with a text area that accepts one feed per line.
Include some sort of authentication.

Ways to get this done
Google App Engine handles #3 for me nicely and gives me free hosting. But feedparser doesn't play well with it and I'm not about to do that kind of debugging. Maybe I ought to try BeautifulSoup?

I'm tempted to try a pure Django system, since that could handle all three, but then I would have to pay for hosting. The same would go for Grok as well. I don't want to pay for hosting yet. Or maybe I ought to just pony up a few bucks a month anyhow...

Of course, I can always write my own simple wxPython client.

What to do... what to do...

Update: Never code on two hours sleep. I'm going with Google app Engine because I realized that when you import of feedparser you can't do this:

import feedparser.py

pydanny

Tuesday, April 7, 2009

The end of my Feedfeeder story

Wednesday, October 22, 2008

Morning brainstorm about FeedFeeder v2

Monday, September 15, 2008

Plone OS projects take two: Radius package and FeedFeeder package

Sunday, April 27, 2008

Feedparser does not work with Google App Engine

Thursday, April 24, 2008

What I want in a feed aggregator

About Me

Two Scoops of Django

Popular Posts

Links of importance

Old Blog Archive

Labels